Python in easy steps (2014)
6
Managing strings
This chapter demonstrates how to work with string data values and text files in Python programs.
Manipulating strings
Formatting strings
Modifying strings
Converting strings
Accessing files
Reading and writing files
Updating file strings
Pickling data
Summary
Manipulating strings
String values can be manipulated in a Python program using the various operators listed in the table below:
Operator: |
Description: |
Example: |
+ |
Concatenate - join strings together |
‘Hello’ + ‘Mike’ |
* |
Repeat - multiply the string |
‘Hello’ * 2 |
[ ] |
Slice - select a character at a specified index position |
‘Hello’ [0] |
[ : ] |
Range Slice - select characters in a specified index range |
‘Hello’ [ 0 : 4 ] |
in |
Membership Inclusive - return True if character exists in the string |
‘H’ in ‘Hello’ |
not in |
Membership Exclusive - return True if character doesn’t exist in string |
‘h’ not in ‘Hello’ |
r/R |
Raw String - suppress meaning of escape characters |
print( r’\n’ ) |
‘‘‘ ‘‘‘ |
Docstring - describe a module, function, class, or method |
def sum( a,b ) : ‘‘‘ Add Args ‘‘‘ |
The membership operators perform a case-sensitive match, so ‘A’ in ‘abc’ will fail.
The [ ] slice operator and [ : ] range slice operator recognize that a string is simply a list containing an individual character within each list element, which can be referenced by their index number.
Similarly, the in and not in membership operators iterate through each element seeking to match the specified character.
The raw string operator r (or uppercase R) must be placed immediately before the opening quote mark to suppress escape characters in the string and is useful when the string contains the backslash character.
A “docstring” is a descriptive string literal that occurs as the first statement in a module, a function, a class, or a method definition. This should be enclosed within triple single quote marks. Uniquely, the docstring becomes the __doc__ special attribute of that object, so can be referenced using its name and dot-suffixing. All modules should normally have docstrings, and all functions and classes exported by a module should also have docstrings.
The Range Slice returns the string up to, but not including, the final specified index position.
Start a new Python script by defining a simple function that includes a docstring description
def display( s ) :
‘‘’Display an argument value.’’’
print( s )
manipulate.py
Next, add a statement to display the function description
display( display.__doc__ )
Now, add a statement to display a raw string value that contains the backslash character
display( r’C:\Program Files’ )
Then, add a statement to display a concatenation of two string values that include an escape character and a space
display( ‘\nHello’ + ‘ Python’ )
Next, add a statement to display a slice of a specified string within a range of element index numbers
display( ‘Python In Easy Steps\n’ [ 7 : ] )
Finally, display the results of seeking characters within a specified string
display( ‘P’ in ‘Python’ )
display( ‘p’ in ‘Python’ )
Save the file in your scripts directory then open a Command Prompt window there and run this program - to see manipulated strings get displayed
Remember that strings must be enclosed within either single quote marks or double quote marks.
With range slice, if the start index number is omitted, zero is assumed and if the end index number is omitted, the string length is assumed.
Formatting strings
The Python built-in dir() function can be useful to examine the names of functions and variables defined in a module by specifying the module name within its parentheses. Interactive mode can easily be used for this purpose by importing the module name then calling the dir() function. The example below examines the “dog” module created here in the previous chapter:
Notice that the __doc__ attribute introduced in the previous example appears listed here by the dir() function.
Those defined names that begin and end with a double underscore are Python objects, whereas the others are programmer-defined. The builtins__ module can also be examined using the dir() function, to examine the names of functions and variables defined by default, such as the print()function and a str object.
The str object defines several useful methods for string formatting, including an actual format() method that performs replacements. A string to be formatted by the format() method can contain both text and “replacement fields” marking places where text is to be inserted from an ordered comma-separated list of values. Each replacement field is denoted by { } braces, which may, optionally, contain the index number position of the replacement in the list.
Strings may also be formatted using the C-style %s substitution operator to mark places in a string where text is to be inserted from a comma-separated ordered list of values.
Do not confuse the str object described here with the str() function that converts values to the string data type.
Start a new Python script by initializing a variable with a formatted string
snack = ‘{} and {}’.format( ‘Burger’ , ‘Fries’ )
format.py
Next, display the variable value to see the text replaced in their listed order
print( ‘\nReplaced:’ , snack )
Now, assign a differently-formatted string to the variable
snack = ‘{1} and {0}’.format( ‘Burger’ , ‘Fries’ )
Then, display the variable value again to see the text now replaced by their specified index element value
print( ‘Replaced:’ , snack )
Assign another formatted string to the variable
snack = ‘%s and %s’ % ( ‘Milk’ , ‘Cookies’ )
Finally, display the variable value once more to see the text substituted in their listed order
print( ‘\nSubstituted:’ , snack )
Save the file in your scripts directory then open a Command Prompt window there and run this program - to see formatted strings get displayed
You cannot leave spaces around the index number in the replacement field.
Other data types can be substituted using %d for a decimal integer, %c for a character, and %f for a floating-point number.
Modifying strings
The Python str object has many useful methods that can be dot-suffixed to its name for modification of the string and to examine its contents. The most commonly used string modification methods are listed in the table below together with a brief description:
Method: |
Description: |
capitalize( ) |
Change string’s first letter to uppercase |
title( ) |
Change all first letters to uppercase |
upper( ) |
Change the case of all letters to uppercase, to lowercase, or to the inverse of the current case respectively |
join( seq ) |
Merge string into separator sequence seq |
lstrip( ) rstrip ( ) strip( ) |
Remove leading whitespace, trailing whitespace, or both leading and trailing whitespace respectively |
replace( old , new ) |
Replace all occurrences of old with new |
ljust( w , c ) |
Pad string to right or left respectively to total column width w with character c |
center( w , c ) |
Pad string each side to total column width w with character c ( default is space ) |
count( sub ) |
Return the number of occurrences of sub |
find( sub ) |
Return the index number of the first occurrence of sub or return -1 if not found |
startswith( sub ) |
Return True if sub is found at start or end respectively- otherwise return False |
isalpha( ) |
Return True if all characters are letters only, are numbers only, are letters or numbers only - otherwise return False |
islower( ) |
Return True if string characters are lowercase, uppercase, or all first letters are uppercase only - otherwise return False |
isspace( ) |
Return True if string contains only whitespace - otherwise return False |
isdigit( ) |
Return True if string contains only digits or decimals - otherwise return False |
A space character is not alphanumeric so isalnum() returns False when examining strings that contain spaces.
Start a new Python script by initializing a variable with a string of lowercase characters and spaces
string = ‘python in easy steps’
modify.py
Next, display the string capitalized, titled, and centered
print( ‘\nCapitalized:\t’ , string.capitalize() )
print( ‘\nTitled:\t\t’ , string.title() )
print( ‘\nCentered:\t’ , string.center( 30 , ‘*’ ) )
Now, display the string in all uppercase and merged with a sequence of two asterisks
print( ‘\nUppercase:\t’ , string.upper() )
print( ‘\nJoined:\t\t’ , string.join( ‘**’ ) )
Then, display the string padded with asterisks on the left
print( ‘\nJustified:\t’ ,string.rjust( 30 , ‘*’ ) )
Finally, display the string with all occurrences of the ‘s’ character replaced by asterisks
print( ‘\nReplaced:\t’ , string.replace( ‘s’ , ‘*’ ) )
Save the file in your scripts directory then open a Command Prompt window there and run this program - to see modified strings get displayed
With the rjust() method a RIGHT-justified string gets padding added to its LEFT, and with the ljust() method a LEFT-justified string gets padding added to its RIGHT.
Converting strings
Before Python 3.0, string characters were stored by their ASCII numeric code values in the range 0-127, representing only unaccented Latin characters. For example, the lowercase letter ‘a’ is assigned 97 as its code value. Each byte of computer memory can, in fact, store values in the range 0-255 but this is still too limited to represent all accented characters and non-Latin characters. For example, accented characters used in Western Europe and the Cyrillic alphabet used for Russian cannot be represented in the range 128-255 because there are more than 127 such characters. Recent versions of Python overcome this limitation by storing string characters as their Unicode code point value to represent all characters and alphabets in the numeric range 0-1,114,111. Characters that are above the ASCII range may require two bytes for their code point value, such as hexadecimal 0xC3 0xB6 for ‘ö’.
The term “ASCII” is an acronym for American Standard Code for Information Interchange.
The str object’s encode() method can be used to convert from the default Unicode encoding and its decode() method can be used to convert back to the Unicode default encoding.
Python’s “unicodedata” module, usefully, provides a name() method that reveals the Unicode name of each character. Accented and non-Latin characters can be referenced by their Unicode name or by decoding their Unicode hexadecimal code point value.
Start a new Python script by initializing a variable with a string containing a non-ASCII character then display its value, data type, and string length
s = ‘Röd’
print( ‘\nRed String:’ , s )
print( ‘Type:’ , type( s ) , ‘\tLength:’ , len( s ) )
unicode.py
Next, encode the string and again display its value, data type, and string length
s = s.encode( ‘utf-8’ )
print( ‘\nEncoded String:’ , s )
print( ‘Type:’ , type( s ) , ‘\tLength:’ , len( s ) )
Now, decode the string and once more display its value, data type, and string length - to reveal the hexadecimal code point of the non-ASCII character
s = s.decode( ‘utf-8’ )
print( ‘\nDecoded String:’ , s )
print( ‘Type:’ , type( s ) , ‘\tLength:’ , len( s ) )
Then, add statements to make “unicodedata” features available and a loop to reveal the Unicode name of each character in the string
import unicodedata
for i in range( len( s ) ) :
print( s[ i ] , unicodedata.name( s[ i ] ) , sep = ‘ : ‘ )
Next, add statements to assign the variable a new value that includes a hexadecimal code point for a non-ASCII character then display the decoded string value
s = b’Gr\xc3\xb6n’
print( ‘\nGreen String:’ , s.decode( ‘utf-8’ ) )
Finally, add statements to assign the variable another new value that includes a Unicode character name for a non-ASCII character then display the string value
s = ‘Gr\N{LATIN SMALL LETTER O WITH DIAERESIS}n’
print( ‘Green String:’ , s )
Save the file in your scripts directory then open a Command Prompt window there and run this program - to see converted strings and unicode character names
A string containing byte addresses must be immediately prefixed by a b to denote that string as a byte literal.
Unicode names are uppercase and referenced by inclusion between { } braces prefixed by a \N in this notation format.
Accessing files
The __builtins__ module can be examined using the dir() function to reveal that it contains a file object that defines several methods for working with files, including open(), read(), write(), and close().
Before a file can be read or written it, firstly, must always be opened using the open() method. This requires two string arguments to specify the name and location of the file, and one of the following “mode” specifiers in which to open the file:
File mode: |
Operation: |
r |
Open an existing file to read |
w |
Open an existing file to write. Creates a new file if none exists or opens an existing file and discards all its previous contents |
a |
Append text. Opens or creates a text file for writing at the end of the file |
r+ |
Open a text file to read from or write to |
w+ |
Open a text file to write to or read from |
a+ |
Open or creates a text file to read from or write to at the end of the file |
Where the mode includes a b after any of the file modes listed above, the operation relates to a binary file rather than a text file. For example, rb or w+b |
File mode arguments are string values so must be surrounded by quotes.
Once a file is opened and you have a file object, you can get various details related to that file from its properties:
Property: |
Description: |
name |
Name of the opened file |
mode |
Mode in which the file was opened |
closed |
Status boolean value of True or False |
readable( ) |
Read permission boolean value of True or False |
writable( ) |
Write permission boolean value of True or False |
You can also use a readlines() method that returns a list of all lines.
Start a new Python script by creating a file object for a new text file named “example.txt” to write content into
file = open( ‘example.txt’ , ‘w’ )
access.py
Next, add statements to display the file name and mode
print( ‘File Name:’ , file.name )
print( ‘File Open Mode:’ , file.mode )
Now, add statements to display the file access permissions
print( ‘Readable:’ , file.readable() )
print( ‘Writable:’ , file.writable() )
Then, define a function to determine the file’s status
def get_status( f ) :
if ( f.closed != False ) :
return ‘Closed’
else :
return ‘Open’
Finally, add statements to display the current file status then close the file and display the file status once more
print( ‘File Status:’ , get_status( file ) )
file.close()
print( ‘\nFile Status:’ , get_status( file ) )
Save the file in your scripts directory then open a Command Prompt window there and run this program - to see a file get opened for writing then get closed
If your program tries to open a non-existent file in r mode the interpreter will report an error.
Reading and writing files
Once a file has been successfully opened it can be read, or added to, or new text can be written in the file, depending on the mode specified in the call to the open() method. Following this, the open file must then always be closed by calling the close() method.
As you might expect, the read() method returns the entire content of the file and the write() method adds content to the file.
You can quickly and efficiently read the entire contents in a loop, iterating line by line.
Start a new Python script by initializing a variable with a concatenated string containing newline characters
poem = ‘I never saw a man who looked\n’
poem += ‘With such a wistful eye\n’
poem += ‘Upon that little tent of blue\n’
poem += ‘Which prisoners call the sky\n’
file.py
Next, add a statement to create a file object for a new text file named “poem.txt” to write content into
file = open( ‘poem.txt’ , ‘w’ )
Now, add statements to write the string contained in the variable into the text file, then close that file
file.write( poem )
file.close()
Then, add a statement to create a file object for the existing text file “poem.txt” to read from
file = open( ‘poem.txt’ , ‘r’ )
Now, add statements to display the contents of the text file, then close that file
for line in file :
print( line , end = ‘’ )
file.close()
Writing to an existing file will automatically overwrite its contents!
Save the file in your scripts directory then open a Command Prompt window there and run this program - to see the file get created then read out to display
Launch the Notepad text editor to confirm the new text file exists and reveal its contents written by the program
Now, add statements at the end of the program to append a citation to the text file then save the script file again
file = open( ‘poem.txt’ , ‘a’ )
file.write( ‘(Oscar Wilde)’ )
file.close()
Run this program again to re-write the text file then view its contents in Notepad - to see the citation now appended after the original text content
Suppress the default newline provided by the print() function where the strings themselves contain newlines.
You can also use the file object’s readlines() method that returns a list of all lines in a file - one line per element.
Updating file strings
A file object’s read() method will, by default, read the entire contents of the file from the very beginning, at index position zero, to the very end - at the index position of the final character. Optionally, the read() method can accept an integer argument to specify how many characters it should read.
The position within the file, from which to read or at which to write, can be finely controlled using the file object’s seek() method. This accepts an integer argument specifying how many characters to move position as an offset from the start of the file.
The current position within a file can be discovered at any time by calling the file object’s tell() method to return an integer location.
When working with file objects it is good practice to use the Python with keyword to group the file operational statements within a block. This technique ensures that the file is properly closed after operations end, even if an exception is raised on the way, and much shorter than writing equivalent try except blocks.
Start a new Python script by assigning a string value to a variable containing text to be written in a file
text = ‘The political slogan “Workers Of The World Unite!” is from The Communist Manifesto.’
update.py
Next, add statements to write the text string into a file and display the file’s current status in the “with” block
with open( ‘update.txt’ , ‘w’ ) as file :
file.write( text )
print( ‘\nFile Now Closed?:’ , file.closed )
Now, add a non-indented statement after the “with” code block to display the file’s new status
print( ‘File Now Closed?:’ , file.closed )
Then, re-open the file and display its contents to confirm it now contains the entire text string
with open( ‘update.txt’ , ‘r+’ ) as file :
text = file.read()
print( ‘\nString:’ , text )
Next, add indented statements to display the current file position, then reposition and display that new position
print( ‘\nPosition In File Now:’ , file.tell() )
position = file.seek( 33 )
print( ‘Position In File Now:’ , file.tell() )
Now, add an indented statement to overwrite the text from the current file position
file.write( ‘All Lands’ )
Then, add indented statements to reposition in the file once more and overwrite the text from the new position
file.seek( 59 )
file.write( ‘the tombstone of Karl Marx.’ )
Finally, add indented statements to return to the start of the file and display its entire updated contents
file.seek( 0 )
text = file.read()
print( ‘\nString:’ , text )
Save the file to your scripts directory then open a Command Prompt window there and run this program - to see the file strings get updated
The seek() method may, optionally, accept a second argument value of 0, 1, or 2 to move the specified number of characters from the start, current, or end position respectively - zero is the default start position.
As with strings, the first character in a file is at index position zero - not at index position one.
Pickling data
In Python, string data can easily be stored in text files using the techniques demonstrated in the previous examples. Other data types, such as numbers, lists, or dictionaries, could also be stored in text files but would require conversion to strings first. Restoring that stored data to their original data type on retrieval would require another conversion. An easier way to achieve data persistence of any data object is provided by the “pickle” module.
The process of “pickling” objects stores a string representation of an object that can later be “unpickled” to its former state, and is a very common Python programming procedure.
An object can be converted for storage in a file by specifying the object and file as arguments to the pickle object’s dump() method. It can later be restored from that file by specifying the file name as the sole argument to the pickle object’s load() method.
Unless the storage file needs to be human-readable for some reason, it is more efficient to use a machine-readable binary file.
Where the program needs to check for the existence of a storage file, the “os” module provides a path object with an isfile() method that returns True if a file specified within its parentheses is found.
Start a new Python script by making “pickle” and “os” module methods available
import pickle , os
data.py
Next, add a statement to test that a specific data file does not already exist
if not os.path.isfile( ‘pickle.dat’ ) :
Now, add a statement to create a list of two elements if the specified file is not found
data = [ 0 , 1 ]
Then, add statements to request user data to be assigned to each of the list elements
data[ 0 ] = input( ‘Enter Topic: ‘ )
data[ 1 ] = input( ‘Enter Series: ‘ )
Next, add a statement to create a binary file for writing to
file = open( ‘pickle.dat’ , ‘wb’ )
Now, add a statement to dump the values contained in the variables as data into the binary file
pickle.dump( data , file )
Then, after writing the file remember to close it
file.close()
Next, add alternative statements to open an existing file to read from if a specific data file does already exist
else :
file = open( ‘pickle.dat’ , ‘rb’ )
Now, add statements to load the data stored in that existing file into a variable then close the file
data = pickle.load( file )
file.close()
Finally, add a statement to display the restored data
print( ‘Welcome Back To:‘ + data[0] + ‘,’ + data[1] )
Save the file in your scripts directory then open a Command prompt window there and run this program - to see user input get stored in a file then get retrieved
Pickling is the standard way to create Python objects that can be used in other programs.
Although this example just stores two string values in a list, pickling can store almost any type of Python object.
Summary
•Strings can be manipulated by operators for concatenation +, selecting slices [ ], and membership with in and not in
•The special __doc__ attribute can contain a “docstring” describing a module, function, class, or method
•Python’s built-in dir() function can be useful to examine the names of functions and variables defined in a module
•The __builtins__ module contains functions and variables that are available by default, such as the print() function
•A str object has a format() method for string formatting and many methods for string modification, such as capitalize()
•Unicode character encoding is used by default but this can be changed with the str object’s encode() and decode() methods
•The unicodedata module provides a name() method that reveals the Unicode name of each character
•A file object has open(), read(), write(), and close() methods for working with files, and features that describe the file properties
•The open() method must specify a file name string argument and a file mode string argument, such as ’r’ to read the file
•Position in a file, at which to read or write, can be specified with the seek() method and reported by the tell() method
•The Python with keyword groups file operational statements within a block and automatically closes an open file
•The process of “pickling” objects stores a string representation of an object that can later be “unpickled” to its former state
•A pickle object’s dump() method requires arguments to specify an object for conversion and a file name in which to store data
•Stored object data can be retrieved by specifying the file name in which it is stored to the pickle object’s load() method