Mangle Data Like a Pro - Introducing Python (2014)

Introducing Python (2014)

Chapter 7. Mangle Data Like a Pro

In this chapter, you’ll learn many techniques for taming data. Most of them concern these built-in Python data types:

strings

Sequences of Unicode characters, used for text data.

bytes and bytearrays

Sequences of eight-bit integers, used for binary data.

Text Strings

Text is the most familiar type of data to most readers, so we’ll begin with some of the powerful features of text strings in Python.

Unicode

All of the text examples in this book thus far have been plain old ASCII. ASCII was defined in the 1960s, when computers were the size of refrigerators and only slightly better at performing computations. The basic unit of computer storage is the byte, which can store 256 unique values in its eight bits. For various reasons, ASCII only used 7 bits (128 unique values): 26 uppercase letters, 26 lowercase letters, 10 digits, some punctuation symbols, some spacing characters, and some nonprinting control codes.

Unfortunately, the world has more letters than ASCII provides. You could have a hot dog at a diner, but never a Gewürztraminer[5] at a café. Many attempts have been made to add more letters and symbols, and you’ll see them at times. Just a couple of those include:

§ Latin-1, or ISO 8859-1

§ Windows code page 1252

Each of these uses all eight bits, but even that’s not enough, especially when you need non-European languages. Unicode is an ongoing international standard to define the characters of all the world’s languages, plus symbols from mathematics and other fields.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

— The Unicode Consortium

The Unicode Code Charts page has links to all the currently defined character sets with images. The latest version (6.2) defines over 110,000 characters, each with a unique name and identification number. The characters are divided into eight-bit sets called planes. The first 256 planes are the basic multilingual planes. See the Wikipedia page about Unicode planes for details.

Python 3 Unicode strings

Python 3 strings are Unicode strings, not byte arrays. This is the single largest change from Python 2, which distinguished between normal byte strings and Unicode character strings.

If you know the Unicode ID or name for a character, you can use it in a Python string. Here are some examples:

§ A \u followed by four hex numbers[6] specifies a character in one of Unicode’s 256 basic multilingual planes. The first two are the plane number (00 to FF), and the next two are the index of the character within the plane. Plane 00 is good old ASCII, and the character positions within that plane are the same as ASCII.

§ For characters in the higher planes, we need more bits. The Python escape sequence for these is \U followed by eight hex characters; the leftmost ones need to be 0.

§ For all characters, \N{ name } lets you specify it by its standard name. The Unicode Character Name Index page lists these.

The Python unicodedata module has functions that translate in both directions:

§ lookup()—Takes a case-insensitive name and returns a Unicode character

§ name()—Takes a Unicode character and returns an uppercase name

In the following example, we’ll write a test function that takes a Python Unicode character, looks up its name, and looks up the character again from the name (it should match the original character):

>>> def unicode_test(value):

... import unicodedata

... name = unicodedata.name(value)

... value2 = unicodedata.lookup(name)

... print('value="%s", name="%s", value2="%s"' % (value, name, value2))

...

Let’s try some characters, beginning with a plain ASCII letter:

>>> unicode_test('A')

value="A", name="LATIN CAPITAL LETTER A", value2="A"

ASCII punctuation:

>>> unicode_test('$')

value="$", name="DOLLAR SIGN", value2="$"

A Unicode currency character:

>>> unicode_test('\u00a2')

value="¢", name="CENT SIGN", value2="¢"

Another Unicode currency character:

>>> unicode_test('\u20ac')

value="€", name="EURO SIGN", value2="€"

The only problem you could potentially run into is limitations in the font you’re using to display text. All fonts do not have images for all Unicode characters, and might display some placeholder character. For instance, here’s the Unicode symbol for SNOWMAN, like symbols in dingbat fonts:

>>> unicode_test('\u2603')

value="☃", name="SNOWMAN", value2="☃"

Suppose that we want to save the word café in a Python string. One way is to copy and paste it from a file or website and hope that it works:

>>> place = 'café'

>>> place

'café'

This worked because I copied and pasted from a source that used UTF-8 encoding (which you’ll see in a few pages) for its text.

How can we specify that final é character? If you look at character index for E, you see that the name E WITH ACUTE, LATIN SMALL LETTER has the value 00E9. Let’s check with the name() and lookup() functions that we were just playing with. First give the code to get the name:

>>> unicodedata.name('\u00e9')

'LATIN SMALL LETTER E WITH ACUTE'

Next, give the name to look up the code:

>>> unicodedata.lookup('E WITH ACUTE, LATIN SMALL LETTER')

Traceback (most recent call last):

File "<stdin>", line 1, in<module>

KeyError: "undefined character name 'E WITH ACUTE, LATIN SMALL LETTER'"

NOTE

The names listed on the Unicode Character Name Index page were reformatted to make them sort nicely for display. To convert them to their real Unicode names (the ones that Python uses), remove the comma and move the part of the name that was after the comma to the beginning. Accordingly, change E WITH ACUTE, LATIN SMALL LETTER to LATIN SMALL LETTER E WITH ACUTE:

>>> unicodedata.lookup('LATIN SMALL LETTER E WITH ACUTE')

'é'

Now, we can specify the string café by code or by name:

>>> place = 'caf\u00e9'

>>> place

'café'

>>> place = 'caf\N{LATIN SMALL LETTER E WITH ACUTE}'

>>> place

'café'

In the preceding snippet, we inserted the é directly in the string, but we can also build a string by appending:

>>> u_umlaut = '\N{LATIN SMALL LETTER U WITH DIAERESIS}'

>>> u_umlaut

'ü'

>>> drink = 'Gew' + u_umlaut + 'rztraminer'

>>> print('Now I can finally have my', drink, 'in a', place)

Now I can finally have my Gewürztraminer ina café

The string len function counts Unicode characters, not bytes:

>>> len('$')

1

>>> len('\U0001f47b')

1

Encode and decode with UTF-8

You don’t need to worry about how Python stores each Unicode character when you do normal string processing.

However, when you exchange data with the outside world, you need a couple of things:

§ A way to encode character strings to bytes

§ A way to decode bytes to character strings

If there were fewer than 64,000 characters in Unicode, we could store each Unicode character ID in two bytes. Unfortunately, there are more. We could encode each ID in three or four bytes, but that would increase the memory and disk storage space needs for common text strings by three or four times.

Ken Thompson and Rob Pike, whose names will be familiar to Unix developers, designed the UTF-8 dynamic encoding scheme one night on a placemat in a New Jersey diner. It uses one to four bytes per Unicode character:

§ One byte for ASCII

§ Two bytes for most Latin-derived (but not Cyrillic) languages

§ Three bytes for the rest of the basic multilingual plane

§ Four bytes for the rest, including some Asian languages and symbols

UTF-8 is the standard text encoding in Python, Linux, and HTML. It’s fast, complete, and works well. If you use UTF-8 encoding throughout your code, life will be much easier than trying to hop in and out of various encodings.

NOTE

If you create a Python string by copying and pasting from another source such as a web page, be sure the source is encoded in the UTF-8 format. It’s very common to see text that was encoded as Latin-1 or Windows 1252 copied into a Python string, which causes an exception later with an invalid byte sequence.

Encoding

You encode a string to bytes. The string encode() function’s first argument is the encoding name. The choices include those presented in Table 7-1.

Table 7-1. Encodings

'ascii'

Good old seven-bit ASCII

'utf-8'

Eight-bit variable-length encoding, and what you almost always want to use

'latin-1'

Also known as ISO 8859-1

'cp-1252'

A common Windows encoding

'unicode-escape'

Python Unicode literal format, \uxxxx or \Uxxxxxxxx

You can encode anything as UTF-8. Let’s assign the Unicode string '\u2603' to the name snowman:

>>> snowman = '\u2603'

snowman is a Python Unicode string with a single character, regardless of how many bytes might be needed to store it internally:

>>> len(snowman)

1

Next let’s encode this Unicode character to a sequence of bytes:

>>> ds = snowman.encode('utf-8')

As I mentioned earlier, UTF-8 is a variable-length encoding. In this case, it used three bytes to encode the single snowman Unicode character:

>>> len(ds)

3

>>> ds

b'\xe2\x98\x83'

Now, len() returns the number of bytes (3) because ds is a bytes variable.

You can use encodings other than UTF-8, but you’ll get errors if the Unicode string can’t be handled by the encoding. For example, if you use the ascii encoding, it will fail unless your Unicode characters happen to be valid ASCII characters as well:

>>> ds = snowman.encode('ascii')

Traceback (most recent call last):

File "<stdin>", line 1, in<module>

UnicodeEncodeError: 'ascii' codec can't encode character '\u2603'

in position 0: ordinal not inrange(128)

The encode() function takes a second argument to help you avoid encoding exceptions. Its default value, which you can see in the previous example, is 'strict'; it raises a UnicodeEncodeError if it sees a non-ASCII character. There are other encodings. Use 'ignore' to throw away anything that won’t encode:

>>> snowman.encode('ascii', 'ignore')

b''

Use 'replace' to substitute ? for unknown characters:

>>> snowman.encode('ascii', 'replace')

b'?'

Use 'backslashreplace' to produce a Python Unicode character string, like unicode-escape:

>>> snowman.encode('ascii', 'backslashreplace')

b'\\u2603'

You would use this if you needed a printable version of the Unicode escape sequence.

The following produces character entity strings that you can use in web pages:

>>> snowman.encode('ascii', 'xmlcharrefreplace')

b'☃'

Decoding

We decode byte strings to Unicode strings. Whenever we get text from some external source (files, databases, websites, network APIs, and so on), it’s encoded as byte strings. The tricky part is knowing which encoding was actually used, so we can run it backward and get Unicode strings.

The problem is that nothing in the byte string itself says what encoding was used. I mentioned the perils of copying and pasting from websites earlier. You’ve probably visited websites with odd characters where plain old ASCII characters should be.

Let’s create a Unicode string called place with the value 'café':

>>> place = 'caf\u00e9'

>>> place

'café'

>>> type(place)

<class 'str'>

Encode it in UTF-8 format in a bytes variable called place_bytes:

>>> place_bytes = place.encode('utf-8')

>>> place_bytes

b'caf\xc3\xa9'

>>> type(place_bytes)

<class 'bytes'>

Notice that place_bytes has five bytes. The first three are the same as ASCII (a strength of UTF-8), and the final two encode the 'é'. Now, let’s decode that byte string back to a Unicode string:

>>> place2 = place_bytes.decode('utf-8')

>>> place2

'café'

This worked because we encoded to UTF-8 and decoded from UTF-8. What if we told it to decode from some other encoding?

>>> place3 = place_bytes.decode('ascii')

Traceback (most recent call last):

File "<stdin>", line 1, in<module>

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3:

ordinal not in range(128)

The ASCII decoder threw an exception because the byte value 0xc3 is illegal in ASCII. There are some 8-bit character set encodings in which values between 128 (hex 80) and 255 (hex FF) are legal but not the same as UTF-8:

>>> place4 = place_bytes.decode('latin-1')

>>> place4

'café'

>>> place5 = place_bytes.decode('windows-1252')

>>> place5

'café'

Urk.

The moral of this story: whenever possible, use UTF-8 encoding. It works, is supported everywhere, can express every Unicode character, and is quickly decoded and encoded.

For more information

If you would like to learn more, these links are particularly helpful:

§ Unicode HOWTO

§ Pragmatic Unicode

§ The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Format

We’ve pretty much ignored text formatting—until now. Chapter 2 shows a few string alignment functions, and the code examples have used simple print() statements, or just let the interactive interpreter display values. But it’s time we look at how to interpolate data values into strings—in other words, put the values inside the strings—using various formats. You can use this to produce reports and other outputs for which appearances need to be just so.

Python has two ways of formatting strings, loosely called old style and new style. Both styles are supported in Python 2 and 3 (new style in Python 2.6 and up). Old style is simpler, so we’ll begin there.

Old style with %

The old style of string formatting has the form string % data. Inside the string are interpolation sequences. Table 7-2 illustrates that the very simplest sequence is a % followed by a letter indicating the data type to be formatted.

Table 7-2. Conversion types

%s

string

%d

decimal integer

%x

hex integer

%o

octal integer

%f

decimal float

%e

exponential float

%g

decimal or exponential float

%%

a literal %

Following are some simple examples. First, an integer:

>>> '%s' % 42

'42'

>>> '%d' % 42

'42'

>>> '%x' % 42

'2a'

>>> '%o' % 42

'52'

A float:

>>> '%s' % 7.03

'7.03'

>>> '%f' % 7.03

'7.030000'

>>> '%e' % 7.03

'7.030000e+00'

>>> '%g' % 7.03

'7.03'

An integer and a literal %:

>>> '%d%%' % 100

'100%'

Some string and integer interpolation:

>>> actor = 'Richard Gere'

>>> cat = 'Chester'

>>> weight = 28

>>> "My wife's favorite actor is %s" % actor

"My wife's favorite actor is Richard Gere"

>>> "Our cat %s weighs %s pounds" % (cat, weight)

'Our cat Chester weighs 28 pounds'

That %s inside the string means to interpolate a string. The number of % appearances in the string needs match the number of data items after the %. A single data item such as actor goes right after the %. Multiple data must be grouped into a tuple (bounded by parentheses, separated by commas) such as (cat, weight).

Even though weight is an integer, the %s inside the string converted it to a string.

You can add other values between the % and the type specifier to designate minimum and maximum widths, alignment, and character filling:

For variables, let’s define an integer, n; a float, f; and a string, s:

>>> n = 42

>>> f = 7.03

>>> s = 'string cheese'

Format them using default widths:

>>> '%d %f %s' % (n, f, s)

'42 7.030000 string cheese'

Set a minimum field width of 10 characters for each variable, and align them to the right, filling unused spots on the left with spaces:

>>> '%10d %10f %10s' % (n, f, s)

' 42 7.030000 string cheese'

Use the same field width, but align to the left:

>>> '%-10d %-10f %-10s' % (n, f, s)

'42 7.030000 string cheese'

This time, the same field width, but a maximum character width of 4, and aligned to the right. This setting truncates the string, and limits the float to 4 digits after the decimal point:

>>> '%10.4d %10.4f %10.4s' % (n, f, s)

' 0042 7.0300 stri'

The same song as before, but right-aligned:

>>> '%.4d %.4f %.4s' % (n, f, s)

'0042 7.0300 stri'

Finally, get the field widths from arguments rather than hard-coding them:

>>> '%*.*d %*.*f %*.*s' % (10, 4, n, 10, 4, f, 10, 4, s)

' 0042 7.0300 stri'

New style formatting with {} and format

Old style formatting is still supported. In Python 2, which will freeze at version 2.7, it will be supported forever. However, new style formatting is recommended if you’re using Python 3.

The simplest usage is demonstrated here:

>>> '{} {} {}'.format(n, f, s)

'42 7.03 string cheese'

Old-style arguments needed to be provided in the order in which their % placeholders appeared in the string. With new-style, you can specify the order:

>>> '{2} {0} {1}'.format(f, s, n)

'42 7.03 string cheese'

The value 0 referred to the first argument, f, whereas 1 referred to the string s, and 2 referred to the last argument, the integer n.

The arguments can be a dictionary or named arguments, and the specifiers can include their names:

>>> '{n} {f} {s}'.format(n=42, f=7.03, s='string cheese')

'42 7.03 string cheese'

In this next example, let’s try combining our three values into a dictionary, which looks like this:

>>> d = {'n': 42, 'f': 7.03, 's': 'string cheese'}

In the following example, {0} is the entire dictionary, whereas {1} is the string 'other' that follows the dictionary:

>>> '{0[n]} {0[f]} {0[s]} {1}'.format(d, 'other')

'42 7.03 string cheese other'

These examples all printed their arguments with default formats. Old-style allows a type specifier after the % in the string, but new-style puts it after a :. First, with positional arguments:

>>> '{0:d} {1:f} {2:s}'.format(n, f, s)

'42 7.030000 string cheese'

In this example, we’ll use the same values, but as named arguments:

>>> '{n:d} {f:f} {s:s}'.format(n=42, f=7.03, s='string cheese')

'42 7.030000 string cheese'

The other options (minimum field width, maximum character width, alignment, and so on) are also supported.

Minimum field width 10, right-aligned (default):

>>> '{0:10d} {1:10f} {2:10s}'.format(n, f, s)

' 42 7.030000 string cheese'

Same as the preceding example, but the > characters make the right-alignment more explicit:

>>> '{0:>10d} {1:>10f} {2:>10s}'.format(n, f, s)

' 42 7.030000 string cheese'

Minimum field width 10, left-aligned:

>>> '{0:<10d} {1:<10f} {2:<10s}'.format(n, f, s)

'42 7.030000 string cheese'

Minimum field width 10, centered:

>>> '{0:^10d} {1:^10f} {2:^10s}'.format(n, f, s)

' 42 7.030000 string cheese'

There is one change from old-style: the precision value (after the decimal point) still means the number of digits after the decimal for floats, and the maximum number of characters for strings, but you can’t use it for integers:

>>> '{0:>10.4d} {1:>10.4f} {2:10.4s}'.format(n, f, s)

Traceback (most recent call last):

File "<stdin>", line 1, in<module>

ValueError: Precision notallowed ininteger format specifier

>>> '{0:>10d} {1:>10.4f} {2:>10.4s}'.format(n, f, s)

' 42 7.0300 stri'

The final option is the fill character. If you want something other than spaces to pad your output fields, put it right after the :, before any alignment (<, >, ^) or width specifiers:

>>> '{0:!^20s}'.format('BIG SALE')

'!!!!!!BIG SALE!!!!!!'

Match with Regular Expressions

Chapter 2 touched on simple string operations. Armed with that introductory information, you’ve probably used simple “wildcard” patterns on the command line, such as ls *.py, which means list all filenames ending in .py.

It’s time to explore more complex pattern matching by using regular expressions. These are provided in the standard module re, which we’ll import. You define a string pattern that you want to match, and the source string to match against. For simple matches, usage looks like this:

result = re.match('You', 'Young Frankenstein')

Here, 'You' is the pattern and 'Young Frankenstein' is the source—the string you want to check. match() checks whether the source begins with the pattern.

For more complex matches, you can compile your pattern first to speed up the match later:

youpattern = re.compile('You')

Then, you can perform your match against the compiled pattern:

result = youpattern.match('Young Frankenstein')

match() is not the only way to compare the pattern and source. Here are several other methods you can use:

§ search() returns the first match, if any.

§ findall() returns a list of all non-overlapping matches, if any.

§ split() splits source at matches with pattern and returns a list of the string pieces.

§ sub() takes another replacement argument, and changes all parts of source that are matched by pattern to replacement.

Exact match with match()

Does the string 'Young Frankenstein' begin with the word 'You'? Here’s some code with comments:

>>> import re

>>> source = 'Young Frankenstein'

>>> m = re.match('You', source) # match starts at the beginning of source

>>> if m: # match returns an object; do this to see what matched

... print(m.group())

...

You

>>> m = re.match('^You', source) # start anchor does the same

>>> if m:

... print(m.group())

...

You

How about 'Frank'?

>>> m = re.match('Frank', source)

>>> if m:

... print(m.group())

...

This time match() returned nothing and the if did not run the print statement. As I said earlier, match() works only if the pattern is at the beginning of the source. But search() works if the pattern is anywhere:

>>> m = re.search('Frank', source)

>>> if m:

... print(m.group())

...

Frank

Let’s change the pattern:

>>> m = re.match('.*Frank', source)

>>> if m: # match returns an object

... print(m.group())

...

Young Frank

Following is a brief explanation of how our new pattern works:

§ . means any single character.

§ * means any number of the preceding thing. Together, .* mean any number of characters (even zero).

§ Frank is the phrase that we wanted to match, somewhere.

match() returned the string that matched .*Frank: 'Young Frank'.

First match with search()

You can use search() to find the pattern 'Frank' anywhere in the source string 'Young Frankenstein', without the need for the .* wildcards:

>>> m = re.search('Frank', source)

>>> if m: # search returns an object

... print(m.group())

...

Frank

All matches with findall()

The preceding examples looked for one match only. But what if you want to know how many instances of the single-letter string 'n' are in the string?

>>> m = re.findall('n', source)

>>> m # findall returns a list

['n', 'n', 'n', 'n']

>>> print('Found', len(m), 'matches')

Found 4 matches

How about 'n' followed by any character?

>>> m = re.findall('n.', source)

>>> m

['ng', 'nk', 'ns']

Notice that it did not match that final 'n'. We need to say that the character after 'n' is optional with ?:

>>> m = re.findall('n.?', source)

>>> m

['ng', 'nk', 'ns', 'n']

Split at matches with split()

The example that follows shows you how to split a string into a list by a pattern rather than a simple string (as the normal string split() method would do):

>>> m = re.split('n', source)

>>> m # split returns a list

['You', 'g Fra', 'ke', 'stei', '']

Replace at matches with sub()

This is like the string replace() method, but for patterns rather than literal strings:

>>> m = re.sub('n', '?', source)

>>> m # sub returns a string

'You?g Fra?ke?stei?'

Patterns: special characters

Many descriptions of regular expressions start with all the details of how to define them. I think that’s a mistake. Regular expressions are a not-so-little language in their own right, with too many details to fit in your head at once. They use so much punctuation that they look like cartoon characters swearing.

With these expressions (match(), search(), findall(), and sub()) under your belt, let’s get into the details of building them. The patterns you make apply to any of these functions.

You’ve seen the basics:

§ Literal matches with any non-special characters

§ Any single character except \n with .

§ Any number (including zero) with *

§ Optional (zero or one) with ?

First, special characters are shown in Table 7-3:

Table 7-3. Special characters

Pattern

Matches

\d

a single digit

\D

a single non-digit

\w

an alphanumeric character

\W

a non-alphanumeric character

\s

a whitespace character

\S

a non-whitespace character

\b

a word boundary (between a \w and a \W, in either order)

\B

a non-word boundary

The Python string module has predefined string constants that we can use for testing. We’ll use printable, which contains 100 printable ASCII characters, including letters in both cases, digits, space characters, and punctuation:

>>> import string

>>> printable = string.printable

>>> len(printable)

100

>>> printable[0:50]

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMN'

>>> printable[50:]

'OPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Which characters in printable are digits?

>>> re.findall('\d', printable)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Which characters are digits, letters, or an underscore?

>>> re.findall('\w', printable)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b',

'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',

'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',

'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',

'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X',

'Y', 'Z', '_']

Which are spaces?

>>> re.findall('\s', printable)

[' ', '\t', '\n', '\r', '\x0b', '\x0c']

Regular expressions are not confined to ASCII. A \d will match whatever Unicode calls a digit, not just ASCII characters '0' through '9'. Let’s add two non-ASCII lowercase letters from FileFormat.info:

In this test, we’ll throw in the following:

§ Three ASCII letters

§ Three punctuation symbols that should not match a \w

§ A Unicode LATIN SMALL LETTER E WITH CIRCUMFLEX (\u00ea)

§ A Unicode LATIN SMALL LETTER E WITH BREVE (\u0115)

>>> x = 'abc' + '-/*' + '\u00ea' + '\u0115'

As expected, this pattern found only the letters:

>>> re.findall('\w', x)

['a', 'b', 'c', 'ê', 'ĕ']

Patterns: using specifiers

Now, let’s make “punctuation pizza,” using the main pattern specifiers for regular expressions, which are presented in Table 7-4.

In the table, expr and the other italicized words mean any valid regular expression.

Table 7-4. Pattern specifiers

Pattern

Matches

abc

literal abc

( expr )

expr

expr1 | expr2

expr1 or expr2

.

any character except \n

^

start of source string

$

end of source string

prev ?

zero or one prev

prev *

zero or more prev, as many as possible

prev *?

zero or more prev, as few as possible

prev +

one or more prev, as many as possible

prev +?

one or more prev, as few as possible

prev { m }

m consecutive prev

prev { m, n }

m to n consecutive prev, as many as possible

prev { m, n }?

m to n consecutive prev, as few as possible

[ abc ]

a or b or c (same as a|b|c)

[^ abc ]

not (a or b or c)

prev (?= next )

prev if followed by next

prev (?! next )

prev if not followed by next

(?<= prev ) next

next if preceded by prev

(?<! prev ) next

next if not preceded by prev

Your eyes might cross permanently when trying to read these examples. First, let’s define our source string:

>>> source = '''I wish I may, I wish I might

... Have a dish of fish tonight.'''

First, find wish anywhere:

>>> re.findall('wish', source)

['wish', 'wish']

Next, find wish or fish anywhere:

>>> re.findall('wish|fish', source)

['wish', 'wish', 'fish']

Find wish at the beginning:

>>> re.findall('^wish', source)

[]

Find I wish at the beginning:

>>> re.findall('^I wish', source)

['I wish']

Find fish at the end:

>>> re.findall('fish$', source)

[]

Finally, find fish tonight. at the end:

>>> re.findall('fish tonight.$', source)

['fish tonight.']

The characters ^ and $ are called anchors: ^ anchors the search to the beginning of the search string, and $ anchors it to the end. .$ matches any character at the end of the line, including a period, so that worked. To be more precise, we should escape the dot to match it literally:

>>> re.findall('fish tonight\.$', source)

['fish tonight.']

Begin by finding w or f followed by ish:

>>> re.findall('[wf]ish', source)

['wish', 'wish', 'fish']

Find one or more runs of w, s, or h:

>>> re.findall('[wsh]+', source)

['w', 'sh', 'w', 'sh', 'h', 'sh', 'sh', 'h']

Find ght followed by a non-alphanumeric:

>>> re.findall('ght\W', source)

['ght\n', 'ght.']

Find I followed by wish:

>>> re.findall('I (?=wish)', source)

['I ', 'I ']

And last, wish preceded by I:

>>> re.findall('(?<=I) wish', source)

[' wish', ' wish']

There are a few cases in which the regular expression pattern rules conflict with the Python string rules. The following pattern should match any word that begins with fish:

>>> re.findall('\bfish', source)

[]

Why doesn’t it? As is discussed in Chapter 2, Python employs a few special escape characters for strings. For example, \b means backspace in strings, but in the mini-language of regular expressions it means the beginning of a word. Avoid the accidental use of escape characters by using Python’s raw strings when you define your regular expression string. Always put an r character before your regular expression pattern string, and Python escape characters will be disabled, as demonstrated here:

>>> re.findall(r'\bfish', source)

['fish']

Patterns: specifying match output

When using match() or search(), all matches are returned from the result object m as m.group(). If you enclose a pattern in parentheses, the match will be saved to its own group, and a tuple of them will be available as m.groups(), as shown here:

>>> m = re.search(r'(. dish\b).*(\bfish)', source)

>>> m.group()

'a dish of fish'

>>> m.groups()

('a dish', 'fish')

If you use this pattern (?P< name > expr ), it will match expr, saving the match in group name:

>>> m = re.search(r'(?P<DISH>. dish\b).*(?P<FISH>\bfish)', source)

>>> m.group()

'a dish of fish'

>>> m.groups()

('a dish', 'fish')

>>> m.group('DISH')

'a dish'

>>> m.group('FISH')

'fish'

Binary Data

Text data can be challenging, but binary data can be, well, interesting. You need to know about concepts such as endianness (how your computer’s processor breaks data into bytes) and sign bits for integers. You might need to delve into binary file formats or network packets to extract or even change data. This section will show you the basics of binary data wrangling in Python.

bytes and bytearray

Python 3 introduced the following sequences of eight-bit integers, with possible values from 0 to 255, in two types:

§ bytes is immutable, like a tuple of bytes

§ bytearray is mutable, like a list of bytes

Beginning with a list called blist, this next example creates a bytes variable called the_bytes and a bytearray variable called the_byte_array:

>> blist = [1, 2, 3, 255]

>>> the_bytes = bytes(blist)

>>> the_bytes

b'\x01\x02\x03\xff'

>>> the_byte_array = bytearray(blist)

>>> the_byte_array

bytearray(b'\x01\x02\x03\xff')

NOTE

The representation of a bytes value begins with a b and a quote character, followed by hex sequences such as \x02 or ASCII characters, and ends with a matching quote character. Python converts the hex sequences or ASCII characters to little integers, but shows byte values that are also valid ASCII encodings as ASCII characters.

>>> b'\x61'

b'a'

>>> b'\x01abc\xff'

b'\x01abc\xff'

This next example demonstrates that you can’t change a bytes variable:

>>> the_bytes[1] = 127

Traceback (most recent call last):

File "<stdin>", line 1, in<module>

TypeError: 'bytes' object does notsupport item assignment

But a bytearray variable is mellow and mutable:

>>> the_byte_array = bytearray(blist)

>>> the_byte_array

bytearray(b'\x01\x02\x03\xff')

>>> the_byte_array[1] = 127

>>> the_byte_array

bytearray(b'\x01\x7f\x03\xff')

Each of these would create a 256-element result, with values from 0 to 255:

>>> the_bytes = bytes(range(0, 256))

>>> the_byte_array = bytearray(range(0, 256))

When printing bytes or bytearray data, Python uses \x xx for non-printable bytes and their ASCII equivalents for printable ones (plus some common escape characters, such as \n instead of \x0a). Here’s the printed representation of the_bytes (manually reformatted to show 16 bytes per line):

>>> the_bytes

b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f

\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f

!"#$%&\'()*+,-./

0123456789:;<=>?

@ABCDEFGHIJKLMNO

PQRSTUVWXYZ[\\]^_

`abcdefghijklmno

pqrstuvwxyz{|}~\x7f

\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f

\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f

\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf

\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf

\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf

\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf

\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef

\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

This can be confusing, because they’re bytes (teeny integers), not characters.

Convert Binary Data with struct

As you’ve seen, Python has many tools for manipulating text. Tools for binary data are much less prevalent. The standard library contains the struct module, which handles data similar to structs in C and C++. Using struct, you can convert binary data to and from Python data structures.

Let’s see how this works with data from a PNG file—a common image format that you’ll see along with GIF and JPEG files. We’ll write a small program that extracts the width and height of an image from some PNG data.

We’ll use the O’Reilly logo—the little bug-eyed tarsier shown in Figure 7-1.

The O’Reilly tarsier

Figure 7-1. The O’Reilly tarsier

The PNG file for this image is available on Wikipedia. We don’t show how to read files until Chapter 8, so I downloaded this file, wrote a little program to print its values as bytes, and just typed the values of the first 30 bytes into a Python bytes variable called data for the example that follows. (The PNG format specification stipulates that the width and height are stored within the first 24 bytes, so we don’t need more than that for now.)

>>> import struct

>>> valid_png_header = b'\x89PNG\r\n\x1a\n'

>>> data = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR' + \

... b'\x00\x00\x00\x9a\x00\x00\x00\x8d\x08\x02\x00\x00\x00\xc0'

>>> if data[:8] == valid_png_header:

... width, height = struct.unpack('>LL', data[16:24])

... print('Valid PNG, width', width, 'height', height)

... else:

... print('Not a valid PNG')

...

Valid PNG, width 154 height 141

Here’s what this code does:

§ data contains the first 30 bytes from the PNG file. To fit on the page, I joined two byte strings with + and the continuation character (\).

§ valid_png_header contains the 8-byte sequence that marks the start of a valid PNG file.

§ width is extracted from bytes 16-20, and height from bytes 21-24.

The >LL is the format string that instructs unpack() how to interpret its input byte sequences and assemble them into Python data types. Here’s the breakdown:

§ The > means that integers are stored in big-endian format.

§ Each L specifies a 4-byte unsigned long integer.

You can examine each 4-byte value directly:

>>> data[16:20]

b'\x00\x00\x00\x9a'

>>> data[20:24]0x9a

b'\x00\x00\x00\x8d'

Big-endian integers have the most significant bytes to the left. Because the width and height are each less than 255, they fit into the last byte of each sequence. You can verify that these hex values match the expected decimal values:

>>> 0x9a

154

>>> 0x8d

141

When you want to go in the other direction and convert Python data to bytes, use the struct pack() function:

>>> import struct

>>> struct.pack('>L', 154)

b'\x00\x00\x00\x9a'

>>> struct.pack('>L', 141)

b'\x00\x00\x00\x8d'

Table 7-5 and Table 7-6 show the format specifiers for pack() and unpack().

The endian specifiers go first in the format string.

Table 7-5. Endian specifiers

Specifier

Byte order

<

little endian

>

big endian

Table 7-6. Format specifiers

Specifier

Description

Bytes

x

skip a byte

1

b

signed byte

1

B

unsigned byte

1

h

signed short integer

2

H

unsigned short integer

2

i

signed integer

4

I

unsigned integer

4

l

signed long integer

4

L

unsigned long integer

4

Q

unsigned long long integer

8

f

single precision float

4

d

double precision float

8

p

count and characters

1 + count

s

characters

count

The type specifiers follow the endian character. Any specifier may be preceded by a number that indicates the count; 5B is the same as BBBBB.

You can use a count prefix instead of >LL:

>>> struct.unpack('>2L', data[16:24])

(154, 141)

We used the slice data[16:24] to grab the interesting bytes directly. We could also use the x specifier to skip the uninteresting parts:

>>> struct.unpack('>16x2L6x', data)

(154, 141)

This means:

§ Use big-endian integer format (>)

§ Skip 16 bytes (16x)

§ Read 8 bytes—two unsigned long integers (2L)

§ Skip the final 6 bytes (6x)

Other Binary Data Tools

Some third-party open source packages offer the following, more declarative ways of defining and extracting binary data:

§ bitstring

§ construct

§ hachoir

§ binio

Appendix D has details on how to download and install external packages such as these. For the next example, you need to install construct. Here’s all you need to do:

$ pip install construct

Here’s how to extract the PNG dimensions from our data bytestring by using construct:

>>> from construct import Struct, Magic, UBInt32, Const, String

>>> # adapted from code at https://github.com/construct

>>> fmt = Struct('png',

... Magic(b'\x89PNG\r\n\x1a\n'),

... UBInt32('length'),

... Const(String('type', 4), b'IHDR'),

... UBInt32('width'),

... UBInt32('height')

... )

>>> data = b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR' + \

... b'\x00\x00\x00\x9a\x00\x00\x00\x8d\x08\x02\x00\x00\x00\xc0'

>>> result = fmt.parse(data)

>>> print(result)

Container:

length = 13

type = b'IHDR'

width = 154

height = 141

>>> print(result.width, result.height)

154, 141

Convert Bytes/Strings with binascii()

The standard binascii module has functions with which you can convert between binary data and various string representations: hex (base 16), base 64, uuencoded, and others. For example, in the next snippet, let’s print that 8-byte PNG header as a sequence of hex values, instead of the mixture of ASCII and \x xx escapes that Python uses to display bytes variables:

>>> import binascii

>>> valid_png_header = b'\x89PNG\r\n\x1a\n'

>>> print(binascii.hexlify(valid_png_header))

b'89504e470d0a1a0a'

Hey, this thing works backwards, too:

>>> print(binascii.unhexlify(b'89504e470d0a1a0a'))

b'\x89PNG\r\n\x1a\n'

Bit Operators

Python provides bit-level integer operators, similar to those in the C language. Table 7-7 summarizes them and includes examples with the integers a (decimal 5, binary 0b0101) and b (decimal 1, binary 0b0001).

Table 7-7. Bit-level integer operators

Operator

Description

Example

Decimal result

Binary result

&

and

a & b

1

0b0001

|

or

a | b

5

0b0101

^

exclusive or

a ^ b

4

0b0100

~

flip bits

~a

-6

binary representation depends on int size

<<

left shift

a << 1

10

0b1010

>>

right shift

a >> 1

2

0b0010

These operators work something like the set operators in Chapter 3. The & operator returns bits that are the same in both arguments, and | returns bits that are set in either of them. The ^ operator returns bits that are in one or the other, but not both. The ~ operator reverses all the bits in its single argument; this also reverses the sign because an integer’s highest bit indicates its sign (1 = negative) in two’s complement arithmetic, used in all modern computers. The << and >> operators just move bits to the left or right. A left shift of one bit is the same as multiplying by two, and a right shift is the same as dividing by two.

Things to Do

7.1. Create a Unicode string called mystery and assign it the value '\U0001f4a9'. Print mystery. Look up the Unicode name for mystery.

7.2. Encode mystery, this time using UTF-8, into the bytes variable pop_bytes. Print pop_bytes.

7.3. Using UTF-8, decode pop_bytes into the string variable pop_string. Print pop_string. Is pop_string equal to mystery?

7.4. Write the following poem by using old-style formatting. Substitute the strings 'roast beef', 'ham', 'head', and 'clam' into this string:

My kitty cat likes %s,

My kitty cat likes %s,

My kitty cat fell on his %s

And now thinks he's a %s.

7.5. Write a form letter by using new-style formatting. Save the following string as letter (you’ll use it in the next exercise):

Dear {salutation} {name},

Thank you for your letter. We are sorry that our {product} {verbed} in your

{room}. Please note that it should never be used in a {room}, especially

near any {animals}.

Send us your receipt and {amount} for shipping and handling. We will send

you another {product} that, in our tests, is {percent}% less likely to

have {verbed}.

Thank you for your support.

Sincerely,

{spokesman}

{job_title}

7.6. Make a dictionary called response with values for the string keys 'salutation', 'name', 'product', 'verbed' (past tense verb), 'room', 'animals', 'amount', 'percent', 'spokesman', and 'job_title'. Print letter with the values from response.

7.7. When you’re working with text, regular expressions come in very handy. We’ll apply them in a number of ways to our featured text sample. It’s a poem titled “Ode on the Mammoth Cheese,” written by James McIntyre in 1866 in homage to a seven-thousand-pound cheese that was crafted in Ontario and sent on an international tour. If you’d rather not type all of it, use your favorite search engine and cut and paste the words into your Python program. Or, just grab it from Project Gutenberg. Call the text string mammoth.

We have seen thee, queen of cheese,

Lying quietly at your ease,

Gently fanned by evening breeze,

Thy fair form no flies dare seize.

All gaily dressed soon you'll go

To the great Provincial show,

To be admired by many a beau

In the city of Toronto.

Cows numerous as a swarm of bees,

Or as the leaves upon the trees,

It did require to make thee please,

And stand unrivalled, queen of cheese.

May you not receive a scar as

We have heard that Mr. Harris

Intends to send you off as far as

The great world's show at Paris.

Of the youth beware of these,

For some of them might rudely squeeze

And bite your cheek, then songs or glees

We could not sing, oh! queen of cheese.

We'rt thou suspended from balloon,

You'd cast a shade even at noon,

Folks would think it was the moon

About to fall and crush them soon.

7.8. Import the re module to use Python’s regular expression functions. Use re.findall() to print all the words that begin with c.

7.9. Find all four-letter words that begin with c.

7.10. Find all the words that end with r.

7.11. Find all words that contain exactly three vowels in a row.

7.12. Use unhexlify to convert this hex string (combined from two strings to fit on a page) to a bytes variable called gif:

'47494638396101000100800000000000ffffff21f9' +

'0401000000002c000000000100010000020144003b'

7.13. The bytes in gif define a one-pixel transparent GIF file, one of the most common graphics file formats. A legal GIF starts with the string GIF89a. Does gif match this?

7.14. The pixel width of a GIF is a 16-bit big-endian integer beginning at byte offset 6, and the height is the same size, starting at offset 8. Extract and print these values for gif. Are they both 1?


[5] This wine has an umlaut in Germany, but loses it in France.

[6] Base 16, specified with characters 0-9 and A-F.