Managing strings - Python in easy steps (2014)

Python in easy steps (2014)

6

Managing strings

This chapter demonstrates how to work with string data values and text files in Python programs.

Manipulating strings

Formatting strings

Modifying strings

Converting strings

Accessing files

Reading and writing files

Updating file strings

Pickling data

Summary

Manipulating strings

String values can be manipulated in a Python program using the various operators listed in the table below:

Operator:

Description:

Example:

+

Concatenate - join strings together

‘Hello’ + ‘Mike’

*

Repeat - multiply the string

‘Hello’ * 2

[ ]

Slice - select a character at a specified index position

‘Hello’ [0]

[ : ]

Range Slice - select characters in a specified index range

‘Hello’ [ 0 : 4 ]

in

Membership Inclusive - return True if character exists in the string

‘H’ in ‘Hello’

not in

Membership Exclusive - return True if character doesn’t exist in string

‘h’ not in ‘Hello’

r/R

Raw String - suppress meaning of escape characters

print( r’\n’ )

‘‘‘ ‘‘‘

Docstring - describe a module, function, class, or method

def sum( a,b ) : ‘‘‘ Add Args ‘‘‘

image

The membership operators perform a case-sensitive match, so ‘A’ in ‘abc’ will fail.

The [ ] slice operator and [ : ] range slice operator recognize that a string is simply a list containing an individual character within each list element, which can be referenced by their index number.

Similarly, the in and not in membership operators iterate through each element seeking to match the specified character.

The raw string operator r (or uppercase R) must be placed immediately before the opening quote mark to suppress escape characters in the string and is useful when the string contains the backslash character.

A “docstring” is a descriptive string literal that occurs as the first statement in a module, a function, a class, or a method definition. This should be enclosed within triple single quote marks. Uniquely, the docstring becomes the __doc__ special attribute of that object, so can be referenced using its name and dot-suffixing. All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings.

image

The Range Slice returns the string up to, but not including, the final specified index position.

imageStart a new Python script by defining a simple function that includes a docstring description
def display( s ) :

‘‘’Display an argument value.’’’

print( s )

image

manipulate.py

imageNext, add a statement to display the function description
display( display.__doc__ )

imageNow, add a statement to display a raw string value that contains the backslash character
display( r’C:\Program Files’ )

imageThen, add a statement to display a concatenation of two string values that include an escape character and a space

display( ‘\nHello’ + ‘ Python’ )

imageNext, add a statement to display a slice of a specified string within a range of element index numbers

display( ‘Python In Easy Steps\n’ [ 7 : ] )

imageFinally, display the results of seeking characters within a specified string

display( ‘P’ in ‘Python’ )

display( ‘p’ in ‘Python’ )

imageSave the file in your scripts directory then open a Command Prompt window there and run this program - to see manipulated strings get displayed

image

image

Remember that strings must be enclosed within either single quote marks or double quote marks.

image

With range slice, if the start index number is omitted, zero is assumed and if the end index number is omitted, the string length is assumed.

Formatting strings

The Python built-in dir() function can be useful to examine the names of functions and variables defined in a module by specifying the module name within its parentheses. Interactive mode can easily be used for this purpose by importing the module name then calling the dir() function. The example below examines the “dog” module created here in the previous chapter:

image

image

Notice that the __doc__ attribute introduced in the previous example appears listed here by the dir() function.

Those defined names that begin and end with a double underscore are Python objects, whereas the others are programmer-defined. The builtins__ module can also be examined using the dir() function, to examine the names of functions and variables defined by default, such as the print()function and a str object.

The str object defines several useful methods for string formatting, including an actual format() method that performs replacements. A string to be formatted by the format() method can contain both text and “replacement fields” marking places where text is to be inserted from an ordered comma-separated list of values. Each replacement field is denoted by { } braces, which may, optionally, contain the index number position of the replacement in the list.

Strings may also be formatted using the C-style %s substitution operator to mark places in a string where text is to be inserted from a comma-separated ordered list of values.

image

Do not confuse the str object described here with the str() function that converts values to the string data type.

imageStart a new Python script by initializing a variable with a formatted string

snack ={} and {}.format( ‘Burger’ , ‘Fries’ )

image

format.py

imageNext, display the variable value to see the text replaced in their listed order

print( ‘\nReplaced:’ , snack )

imageNow, assign a differently-formatted string to the variable

snack ={1} and {0}.format( ‘Burger’ , ‘Fries’ )

imageThen, display the variable value again to see the text now replaced by their specified index element value

print( Replaced:’ , snack )

imageAssign another formatted string to the variable

snack = ‘%s and %s% ( ‘Milk’ , ‘Cookies’ )

imageFinally, display the variable value once more to see the text substituted in their listed order

print( ‘\nSubstituted:’ , snack )

imageSave the file in your scripts directory then open a Command Prompt window there and run this program - to see formatted strings get displayed

image

image

You cannot leave spaces around the index number in the replacement field.

image

Other data types can be substituted using %d for a decimal integer, %c for a character, and %f for a floating-point number.

Modifying strings

The Python str object has many useful methods that can be dot-suffixed to its name for modification of the string and to examine its contents. The most commonly used string modification methods are listed in the table below together with a brief description:

Method:

Description:

capitalize( )

Change string’s first letter to uppercase

title( )

Change all first letters to uppercase

upper( )
lower( )
swapcase( )

Change the case of all letters to uppercase, to lowercase, or to the inverse of the current case respectively

join( seq )

Merge string into separator sequence seq

lstrip( )

rstrip ( )

strip( )

Remove leading whitespace, trailing

whitespace, or both leading and trailing

whitespace respectively

replace( old , new )

Replace all occurrences of old with new

ljust( w , c )
rjust( w , c )

Pad string to right or left respectively to total column width w with character c

center( w , c )

Pad string each side to total column width w with character c ( default is space )

count( sub )

Return the number of occurrences of sub

find( sub )

Return the index number of the first occurrence of sub or return -1 if not found

startswith( sub )
endswith( sub )

Return True if sub is found at start or end respectively- otherwise return False

isalpha( )
isnumeric( )
isalnum( )

Return True if all characters are letters only, are numbers only, are letters or numbers only - otherwise return False

islower( )
isupper( )
istitle( )

Return True if string characters are lowercase, uppercase, or all first letters are uppercase only - otherwise return False

isspace( )

Return True if string contains only whitespace - otherwise return False

isdigit( )
isdecimal( )

Return True if string contains only digits or decimals - otherwise return False

image

A space character is not alphanumeric so isalnum() returns False when examining strings that contain spaces.

imageStart a new Python script by initializing a variable with a string of lowercase characters and spaces

string = ‘python in easy steps’

image

modify.py

imageNext, display the string capitalized, titled, and centered

print( ‘\nCapitalized:\t’ , string.capitalize() )

print( ‘\nTitled:\t\t’ , string.title() )

print( ‘\nCentered:\t’ , string.center( 30 , ‘*’ ) )

imageNow, display the string in all uppercase and merged with a sequence of two asterisks

print( ‘\nUppercase:\t’ , string.upper() )

print( ‘\nJoined:\t\t’ , string.join( ‘**’ ) )

imageThen, display the string padded with asterisks on the left

print( ‘\nJustified:\t’ ,string.rjust( 30 , ‘*’ ) )

imageFinally, display the string with all occurrences of the ‘s’ character replaced by asterisks

print( ‘\nReplaced:\t’ , string.replace( ‘s’ , ‘*’ ) )

imageSave the file in your scripts directory then open a Command Prompt window there and run this program - to see modified strings get displayed

image

image

With the rjust() method a RIGHT-justified string gets padding added to its LEFT, and with the ljust() method a LEFT-justified string gets padding added to its RIGHT.

Converting strings

Before Python 3.0, string characters were stored by their ASCII numeric code values in the range 0-127, representing only unaccented Latin characters. For example, the lowercase letter ‘a’ is assigned 97 as its code value. Each byte of computer memory can, in fact, store values in the range 0-255 but this is still too limited to represent all accented characters and non-Latin characters. For example, accented characters used in Western Europe and the Cyrillic alphabet used for Russian cannot be represented in the range 128-255 because there are more than 127 such characters. Recent versions of Python overcome this limitation by storing string characters as their Unicode code point value to represent all characters and alphabets in the numeric range 0-1,114,111. Characters that are above the ASCII range may require two bytes for their code point value, such as hexadecimal 0xC3 0xB6 for ‘ö’.

image

The term “ASCII” is an acronym for American Standard Code for Information Interchange.

The str object’s encode() method can be used to convert from the default Unicode encoding and its decode() method can be used to convert back to the Unicode default encoding.

Python’s “unicodedata” module, usefully, provides a name() method that reveals the Unicode name of each character. Accented and non-Latin characters can be referenced by their Unicode name or by decoding their Unicode hexadecimal code point value.

imageStart a new Python script by initializing a variable with a string containing a non-ASCII character then display its value, data type, and string length

s = ‘Röd’

print( ‘\nRed String:’ , s )

print( ‘Type:’ , type( s ) , ‘\tLength:’ , len( s ) )

image

unicode.py

imageNext, encode the string and again display its value, data type, and string length

s = s.encode( ‘utf-8’ )

print( ‘\nEncoded String:’ , s )

print( ‘Type:’ , type( s ) , ‘\tLength:’ , len( s ) )

imageNow, decode the string and once more display its value, data type, and string length - to reveal the hexadecimal code point of the non-ASCII character

s = s.decode( ‘utf-8’ )

print( ‘\nDecoded String:’ , s )

print( ‘Type:’ , type( s ) , ‘\tLength:’ , len( s ) )

imageThen, add statements to make “unicodedata” features available and a loop to reveal the Unicode name of each character in the string

import unicodedata
for
i in range( len( s ) ) :

print( s[ i ] , unicodedata.name( s[ i ] ) , sep = ‘ : ‘ )

imageNext, add statements to assign the variable a new value that includes a hexadecimal code point for a non-ASCII character then display the decoded string value
s = b’Gr\xc3\xb6n’

print( ‘\nGreen String:’ , s.decode( ‘utf-8’ ) )

imageFinally, add statements to assign the variable another new value that includes a Unicode character name for a non-ASCII character then display the string value

s = ‘Gr\N{LATIN SMALL LETTER O WITH DIAERESIS}n’

print( ‘Green String:’ , s )

imageSave the file in your scripts directory then open a Command Prompt window there and run this program - to see converted strings and unicode character names

image

image

A string containing byte addresses must be immediately prefixed by a b to denote that string as a byte literal.

image

Unicode names are uppercase and referenced by inclusion between { } braces prefixed by a \N in this notation format.

Accessing files

The __builtins__ module can be examined using the dir() function to reveal that it contains a file object that defines several methods for working with files, including open(), read(), write(), and close().

Before a file can be read or written it, firstly, must always be opened using the open() method. This requires two string arguments to specify the name and location of the file, and one of the following “mode” specifiers in which to open the file:

File mode:

Operation:

r

Open an existing file to read

w

Open an existing file to write. Creates a new file if none exists or opens an existing file and discards all its previous contents

a

Append text. Opens or creates a text file for writing at the end of the file

r+

Open a text file to read from or write to

w+

Open a text file to write to or read from

a+

Open or creates a text file to read from or write to at the end of the file

Where the mode includes a b after any of the file modes listed above, the operation relates to a binary file rather than a text file. For example, rb or w+b

image

File mode arguments are string values so must be surrounded by quotes.

Once a file is opened and you have a file object, you can get various details related to that file from its properties:

Property:

Description:

name

Name of the opened file

mode

Mode in which the file was opened

closed

Status boolean value of True or False

readable( )

Read permission boolean value of True or False

writable( )

Write permission boolean value of True or False

image

You can also use a readlines() method that returns a list of all lines.

imageStart a new Python script by creating a file object for a new text file named “example.txt” to write content into

file = open( ‘example.txt’ , ‘w’ )

image

access.py

imageNext, add statements to display the file name and mode

print( ‘File Name:’ , file.name )

print( ‘File Open Mode:’ , file.mode )

imageNow, add statements to display the file access permissions

print( ‘Readable:’ , file.readable() )

print( ‘Writable:’ , file.writable() )

imageThen, define a function to determine the file’s status

def get_status( f ) :

if ( f.closed != False ) :

return ‘Closed’

else :

return ‘Open’

imageFinally, add statements to display the current file status then close the file and display the file status once more

print( ‘File Status:’ , get_status( file ) )

file.close()

print( ‘\nFile Status:’ , get_status( file ) )

imageSave the file in your scripts directory then open a Command Prompt window there and run this program - to see a file get opened for writing then get closed

image

image

If your program tries to open a non-existent file in r mode the interpreter will report an error.

Reading and writing files

Once a file has been successfully opened it can be read, or added to, or new text can be written in the file, depending on the mode specified in the call to the open() method. Following this, the open file must then always be closed by calling the close() method.

As you might expect, the read() method returns the entire content of the file and the write() method adds content to the file.

You can quickly and efficiently read the entire contents in a loop, iterating line by line.

imageStart a new Python script by initializing a variable with a concatenated string containing newline characters

poem = ‘I never saw a man who looked\n’

poem += ‘With such a wistful eye\n’

poem += ‘Upon that little tent of blue\n’

poem += ‘Which prisoners call the sky\n’

image

file.py

imageNext, add a statement to create a file object for a new text file named “poem.txt” to write content into

file = open( ‘poem.txt’ , ‘w’ )

imageNow, add statements to write the string contained in the variable into the text file, then close that file

file.write( poem )

file.close()

imageThen, add a statement to create a file object for the existing text file “poem.txt” to read from

file = open( ‘poem.txt’ , ‘r’ )

imageNow, add statements to display the contents of the text file, then close that file

for line in file :

print( line , end = ‘’ )

file.close()

image

Writing to an existing file will automatically overwrite its contents!

imageSave the file in your scripts directory then open a Command Prompt window there and run this program - to see the file get created then read out to display

image

imageLaunch the Notepad text editor to confirm the new text file exists and reveal its contents written by the program

image

imageNow, add statements at the end of the program to append a citation to the text file then save the script file again

file = open( ‘poem.txt’ , ‘a’ )

file.write( ‘(Oscar Wilde)’ )

file.close()

imageRun this program again to re-write the text file then view its contents in Notepad - to see the citation now appended after the original text content

image

image

Suppress the default newline provided by the print() function where the strings themselves contain newlines.

image

You can also use the file object’s readlines() method that returns a list of all lines in a file - one line per element.

Updating file strings

A file object’s read() method will, by default, read the entire contents of the file from the very beginning, at index position zero, to the very end - at the index position of the final character. Optionally, the read() method can accept an integer argument to specify how many characters it should read.

The position within the file, from which to read or at which to write, can be finely controlled using the file object’s seek() method. This accepts an integer argument specifying how many characters to move position as an offset from the start of the file.

The current position within a file can be discovered at any time by calling the file object’s tell() method to return an integer location.

When working with file objects it is good practice to use the Python with keyword to group the file operational statements within a block. This technique ensures that the file is properly closed after operations end, even if an exception is raised on the way, and much shorter than writing equivalent try except blocks.

imageStart a new Python script by assigning a string value to a variable containing text to be written in a file

text = ‘The political slogan “Workers Of The World Unite!” is from The Communist Manifesto.’

image

update.py

imageNext, add statements to write the text string into a file and display the file’s current status in the “with” block

with open( ‘update.txt’ , ‘w’ ) as file :

file.write( text )

print( ‘\nFile Now Closed?:’ , file.closed )

imageNow, add a non-indented statement after the “with” code block to display the file’s new status

print( ‘File Now Closed?:’ , file.closed )

imageThen, re-open the file and display its contents to confirm it now contains the entire text string

with open( ‘update.txt’ , ‘r+’ ) as file :

text = file.read()

print( ‘\nString:’ , text )

imageNext, add indented statements to display the current file position, then reposition and display that new position

print( ‘\nPosition In File Now:’ , file.tell() )

position = file.seek( 33 )

print( ‘Position In File Now:’ , file.tell() )

imageNow, add an indented statement to overwrite the text from the current file position

file.write( ‘All Lands’ )

imageThen, add indented statements to reposition in the file once more and overwrite the text from the new position

file.seek( 59 )

file.write( ‘the tombstone of Karl Marx.’ )

imageFinally, add indented statements to return to the start of the file and display its entire updated contents

file.seek( 0 )

text = file.read()

print( ‘\nString:’ , text )

imageSave the file to your scripts directory then open a Command Prompt window there and run this program - to see the file strings get updated

image

image

The seek() method may, optionally, accept a second argument value of 0, 1, or 2 to move the specified number of characters from the start, current, or end position respectively - zero is the default start position.

image

As with strings, the first character in a file is at index position zero - not at index position one.

Pickling data

In Python, string data can easily be stored in text files using the techniques demonstrated in the previous examples. Other data types, such as numbers, lists, or dictionaries, could also be stored in text files but would require conversion to strings first. Restoring that stored data to their original data type on retrieval would require another conversion. An easier way to achieve data persistence of any data object is provided by the “pickle” module.

The process of “pickling” objects stores a string representation of an object that can later be “unpickled” to its former state, and is a very common Python programming procedure.

An object can be converted for storage in a file by specifying the object and file as arguments to the pickle object’s dump() method. It can later be restored from that file by specifying the file name as the sole argument to the pickle object’s load() method.

Unless the storage file needs to be human-readable for some reason, it is more efficient to use a machine-readable binary file.

Where the program needs to check for the existence of a storage file, the “os” module provides a path object with an isfile() method that returns True if a file specified within its parentheses is found.

imageStart a new Python script by making “pickle” and “os” module methods available
import pickle , os

image

data.py

imageNext, add a statement to test that a specific data file does not already exist

if not os.path.isfile( ‘pickle.dat’ ) :

imageNow, add a statement to create a list of two elements if the specified file is not found

data = [ 0 , 1 ]

imageThen, add statements to request user data to be assigned to each of the list elements

data[ 0 ] = input( ‘Enter Topic: ‘ )

data[ 1 ] = input( ‘Enter Series: ‘ )

imageNext, add a statement to create a binary file for writing to

file = open( ‘pickle.dat’ , ‘wb’ )

imageNow, add a statement to dump the values contained in the variables as data into the binary file

pickle.dump( data , file )

imageThen, after writing the file remember to close it

file.close()

imageNext, add alternative statements to open an existing file to read from if a specific data file does already exist
else :

file = open( ‘pickle.dat’ , ‘rb’ )

imageNow, add statements to load the data stored in that existing file into a variable then close the file

data = pickle.load( file )

file.close()

imageFinally, add a statement to display the restored data

print( ‘Welcome Back To:‘ + data[0] + ‘,’ + data[1] )

imageSave the file in your scripts directory then open a Command prompt window there and run this program - to see user input get stored in a file then get retrieved

image

image

Pickling is the standard way to create Python objects that can be used in other programs.

image

Although this example just stores two string values in a list, pickling can store almost any type of Python object.

Summary

Strings can be manipulated by operators for concatenation +, selecting slices [ ], and membership with in and not in

The special __doc__ attribute can contain a “docstring” describing a module, function, class, or method

Python’s built-in dir() function can be useful to examine the names of functions and variables defined in a module

The __builtins__ module contains functions and variables that are available by default, such as the print() function

A str object has a format() method for string formatting and many methods for string modification, such as capitalize()

Unicode character encoding is used by default but this can be changed with the str object’s encode() and decode() methods

The unicodedata module provides a name() method that reveals the Unicode name of each character

A file object has open(), read(), write(), and close() methods for working with files, and features that describe the file properties

The open() method must specify a file name string argument and a file mode string argument, such as ’r’ to read the file

Position in a file, at which to read or write, can be specified with the seek() method and reported by the tell() method

The Python with keyword groups file operational statements within a block and automatically closes an open file

The process of “pickling” objects stores a string representation of an object that can later be “unpickled” to its former state

A pickle object’s dump() method requires arguments to specify an object for conversion and a file name in which to store data

Stored object data can be retrieved by specifying the file name in which it is stored to the pickle object’s load() method