Unicode and Byte Strings - Advanced Topics - Learning Python (2013)

Learning Python (2013)

Part VIII. Advanced Topics

Chapter 37. Unicode and Byte Strings

So far, our exploration of strings in this book has been deliberately incomplete. Chapter 4’s types preview briefly introduced Python’s Unicode strings and files without giving many details, and the strings chapter in the core types part of this book (Chapter 7) deliberately limited its scope to the subset of string topics that most Python programmers need to know about.

This was by design: because many programmers, including most beginners, deal with simple forms of text like ASCII, they can happily work with Python’s basic str string type and its associated operations and don’t need to come to grips with more advanced string concepts. In fact, such programmers can often ignore the string changes in Python 3.X and continue to use strings as they may have in the past.

On the other hand, many other programmers deal with more specialized types of data: non-ASCII character sets, image file contents, and so on. For those programmers, and others who may someday join them, in this chapter we’re going to fill in the rest of the Python string story and look at some more advanced concepts in Python’s string model.

Specifically, we’ll explore the basics of Python’s support for Unicode text—rich character strings used in internationalized applications—as well as binary data—strings that represent absolute byte values. As we’ll see, the advanced string representation story has diverged in recent versions of Python:

§ Python 3.X provides an alternative string type for binary data, and supports Unicode text (including ASCII) in its normal string type.

§ Python 2.X provides an alternative string type for non-ASCII Unicode text, and supports both simple text and binary data in its normal string type.

In addition, because Python’s string model has a direct impact on how you process non-ASCII files, we’ll explore the fundamentals of that related topic here as well. Finally, we’ll take a brief look at some advanced string and binary tools, such as pattern matching, object pickling, binary data packing, and XML parsing, and the ways in which they are impacted by 3.X’s string changes.

This is officially an advanced topics chapter, because not all programmers will need to delve into the worlds of Unicode encodings or binary data. For some readers, Chapter 4’s preview may suffice, and others may wish to file this chapter away for future reference. If you ever need to care about processing either of these, though, you’ll find that Python’s string models provide the support you need.

String Changes in 3.X

One of the most noticeable changes in the Python 3.X line is the mutation of string object types. In a nutshell, 2.X’s str and unicode types have morphed into 3.X’s bytes and str types, and a new mutable bytearray type has been added. The bytearray type is technically available in Python 2.6 and 2.7 too (though not earlier), but it’s a back-port from 3.X and does not as clearly distinguish between text and binary content in 2.X.

Especially if you process data that is either Unicode or binary in nature, these changes can have substantial impacts on your code. As a general rule of thumb, how much you need to care about this topic depends in large part upon which of the following categories you fall into:

§ If you deal with non-ASCII Unicode text—for instance, in the context of internationalized domains like the Web, or the results of some XML and JSON parsers and databases—you will find support for text encodings to be different in 3.X, but also probably more direct, accessible, and seamless than in 2.X.

§ If you deal with binary data—for example, in the form of image or audio files or packed data processed with the struct module—you will need to understand 3.X’s new bytes object and 3.X’s different and sharper distinction between text and binary data and files.

§ If you fall into neither of the prior two categories, you can generally use strings in 3.X much as you would in 2.X, with the general str string type, text files, and all the familiar string operations we studied earlier. Your strings will be encoded and decoded by 3.X using your platform’s default encoding (e.g., ASCII, or UTF-8 on Windows in the U.S.—sys.getdefaultencoding gives your default if you care to check), but you probably won’t notice.

In other words, if your text is always ASCII, you can get by with normal string objects and text files and can avoid most of the following story for now. As we’ll see in a moment, ASCII is a simple kind of Unicode and a subset of other encodings, so string operations and files generally “just work” if your programs process only ASCII text.

Even if you fall into the last of the three categories just mentioned, though, a basic understanding of Unicode and 3.X’s string model can help both to demystify some of the underlying behavior now, and to make mastering Unicode or binary data issues easier if they impact you later.

To put that more strongly: like it or not, Unicode will be part of most software development in the interconnected future we’ve sown, and will probably impact you eventually. Though applications are beyond our scope here, if you work with the Internet, files, directories, network interfaces, databases, pipes, JSON, XML, and even GUIs, Unicode may no longer be an optional topic for you in Python 3.X.

Python 3.X’s support for Unicode and binary data is also available in 2.X, albeit in different forms. Although our main focus in this chapter is on string types in 3.X, we’ll also explore how 2.X’s equivalent support differs along the way for readers using 2.X. Regardless of which version you use, the tools we’ll explore here can become important in many types of programs.

String Basics

Before we look at any code, let’s begin with a general overview of Python’s string model. To understand why 3.X changed the way it did on this front, we have to start with a brief look at how characters are actually represented in computers—both when encoded in files and when stored in memory.

Character Encoding Schemes

Most programmers think of strings as series of characters used to represent textual data. While that’s accurate, the way characters are stored can vary, depending on what sort of character set must be recorded. When text is stored on files, for example, its character set determines its format.

Character sets are standards that assign integer codes to individual characters so they can be represented in computer memory. The ASCII standard, for example, was created in the U.S., and it defines many U.S. programmers’ notion of text strings. ASCII defines character codes from 0 through 127 and allows each character to be stored in one 8-bit byte, only 7 bits of which are actually used.

For example, the ASCII standard maps the character 'a' to the integer value 97 (0x61 in hex), which can be stored in a single byte in memory and files. If you wish to see how this works, Python’s ord built-in function gives the binary identifying value for a character, and chr returns the character for a given integer code value:

>>> ord('a') # 'a' is a byte with binary value 97 in ASCII (and others)

97

>>> hex(97)

'0x61'

>>> chr(97) # Binary value 97 stands for character 'a'

'a'

Sometimes one byte per character isn’t enough, though. Various symbols and accented characters, for instance, do not fit into the range of possible characters defined by ASCII. To accommodate special characters, some standards use all the possible values in an 8-bit byte, 0 through 255, to represent characters, and assign the values 128 through 255 (outside ASCII’s range) to special characters.

One such standard, known as the Latin-1 character set, is widely used in Western Europe. In Latin-1, character codes above 127 are assigned to accented and otherwise special characters. The character assigned to byte value 196, for example, is a specially marked non-ASCII character:

>>> 0xC4

196

>>> chr(196) # Python 3.X result form shown

'Ä'

This standard allows for a wide array of extra special characters, but still supports ASCII as a 7-bit subset of its 8-bit representation.

Still, some alphabets define so many characters that it is impossible to represent each of them as one byte. Unicode allows more flexibility. Unicode text is sometimes referred to as “wide-character” strings, because characters may be represented with multiple bytes if needed. Unicode is typically used in internationalized programs, to represent European, Asian, and other non-English character sets that have more characters than 8-bit bytes can represent.

To store such rich text in computer memory, we say that characters are translated to and from raw bytes using an encoding—the rules for translating a string of Unicode characters to a sequence of bytes, and extracting a string from a sequence of bytes. More procedurally, this translation back and forth between bytes and strings is defined by two terms:

§ Encoding is the process of translating a string of characters into its raw bytes form, according to a desired encoding name.

§ Decoding is the process of translating a raw string of bytes into its character string form, according to its encoding name.

That is, we encode from string to raw bytes, and decode from raw bytes to string. To scripts, decoded strings are just characters in memory, but may be encoded into a variety of byte string representations when stored on files, transferred over networks, embedded in documents and databases, and so on.

For some encodings, the translation process is trivial—ASCII and Latin-1, for instance, map each character to a fixed-size single byte, so no translation work is required. For other encodings, the mapping can be more complex and yield multiple bytes per character, even for simple 8-bit forms of text.

The widely used UTF-8 encoding, for example, allows a wide range of characters to be represented by employing a variable-sized number of bytes scheme. Character codes less than 128 are represented as a single byte; codes between 128 and 0x7ff (2047) are turned into 2 bytes, where each byte has a value between 128 and 255; and codes above 0x7ff are turned into 3- or 4-byte sequences having values between 128 and 255. This keeps simple ASCII strings compact, sidesteps byte ordering issues, and avoids null (zero value) bytes that can cause problems for C libraries and networking.

Because their encodings’ character maps assign characters to the same codes for compatibility, ASCII is a subset of both Latin-1 and UTF-8. That is, a valid ASCII character string is also a valid Latin-1- and UTF-8-encoded string. For example, every ASCII file is a valid UTF-8 file, because the ASCII character set is a 7-bit subset of UTF-8.

Conversely, the UTF-8 encoding is binary compatible with ASCII, but only for character codes less than 128. Latin-1 and UTF-8 simply allow for additional characters: Latin-1 for characters mapped to values 128 through 255 within a byte, and UTF-8 for characters that may be represented with multiple bytes.

Other encodings allow for richer character sets in different ways. UTF-16 and UTF-32, for example, format text with a fixed-size 2 and 4 bytes per each character scheme, respectively, even for characters that could otherwise fit in a single byte. Some encodings may also insert prefixes that identify byte ordering.

To see this for yourself, run a string’s encode method, which gives its encoded byte-string format under a named scheme—a two-character ASCII string is 2 bytes in ASCII, Latin-1, and UTF-8, but it’s much wider in UTF-16 and UTF-32, and includes header bytes:

>>> S = 'ni'

>>> S.encode('ascii'), S.encode('latin1'), S.encode('utf8')

(b'ni', b'ni', b'ni')

>>> S.encode('utf16'), len(S.encode('utf16'))

(b'\xff\xfen\x00i\x00', 6)

>>> S.encode('utf32'), len(S.encode('utf32'))

(b'\xff\xfe\x00\x00n\x00\x00\x00i\x00\x00\x00', 12)

These results differ slightly in Python 2.X (you won’t get the leading b for byte strings). But all of these encoding schemes—ASCII, Latin-1, UTF-8, and many others—are considered to be Unicode.

To Python programmers, encodings are specified as strings containing the encoding’s name. Python comes with roughly 100 different encodings; see the Python library reference for a complete list. Importing the module encodings and running help(encodings) shows you many encoding names as well; some are implemented in Python, and some in C. Some encodings have multiple names, too; for example, latin-1, iso_8859_1, and 8859 are all synonyms for the same encoding, Latin-1. We’ll revisit encodings later in this chapter, when we study techniques for writing Unicode strings in a script.

For more on the underlying Unicode story, see the Python standard manual set. It includes a “Unicode HOWTO” in its “Python HOWTOs” section, which provides additional background that we will skip here in the interest of space.

How Python Stores Strings in Memory

The prior section’s encodings really only apply when text is stored or transferred externally, in files and other mediums. In memory, Python always stores decoded text strings in an encoding-neutral format, which may or may not use multiple bytes for each character. All text processing occurs in this uniform internal format. Text is translated to and from an encoding-specific format only when it is transferred to or from external text files, byte strings, or APIs with specific encoding requirements. Once in memory, though, strings have no encoding. They are just the string object presented in this book.

Though irrelevant to your code, it may help some readers to make this more tangible. The way Python actually stores text in memory is prone to change over time, and in fact mutated substantially as of 3.3:

Python 3.2 and earlier

Through Python 3.2, strings are stored internally in fixed-length UTF-16 (roughly, UCS-2) format with 2 bytes per character, unless Python is configured to use 4 bytes per character (UCS-4).

Python 3.3 and later

Python 3.3 and later instead use a variable-length scheme with 1, 2, or 4 bytes per character, depending on a string’s content. The size is chosen based upon the character with the largest Unicode ordinal value in the represented string. This scheme allows a space-efficient representation in common cases, but also allows for full UCS-4 on all platforms.

Python 3.3’s new scheme is an optimization, especially compared to former wide Unicode builds. Per Python documentation: memory footprint is divided by 2 to 4 depending on the text; encoding an ASCII string to UTF-8 doesn’t need to encode characters anymore, because its ASCII and UTF-8 representations are the same; repeating a single ASCII letter and getting a substring of an ASCII strings is 4 times faster; UTF-8 is 2 to 4 times faster; and UTF-16 encoding is up to 10 times faster. On some benchmarks, Python 3.3’s overall memory usage is 2 to 3 times smaller than 3.2, and similar to the less Unicode-centric 2.7.

Regardless of the storage scheme used, as noted in Chapter 6 Unicode clearly requires us to think of strings in terms of characters, instead of bytes. This may be a bigger hurdle for programmers accustomed to the simpler ASCII-only world where each character mapped to a single byte, but that idea no longer applies, in terms of both the results of text string tools and physical character size:

Text tools

Today, both string content and length really correspond to Unicode code points—identifying ordinal numbers for characters. For instance, the built-in ord function now returns a character’s Unicode code point ordinal, which is not necessarily an ASCII code, and which may or may not fit in a single 8-bit byte’s value. Similarly, len returns the number of characters, not bytes; the string is probably larger in memory, and its characters may not fit in bytes anyhow.

Text size

As we saw by example in Chapter 4, under Unicode a single character does not necessarily map directly to a single byte, either when encoded in a file or when stored in memory. Even characters in simple 7-bit ASCII text may not map to bytes—UTF-16 uses multiple bytes per character in files, and Python may allocate 1, 2, or 4 bytes per character in memory. Thinking in terms of characters allows us to abstract away the details of external and internal storage.

The key point here, though, is that encoding pertains mostly to files and transfers. Once loaded into a Python string, text in memory has no notion of an “encoding,” and is simply a sequence of Unicode characters (a.k.a. code points) stored generically. In your script, that string is accessed as a Python string object—the next section’s topic.

Python’s String Types

At a more concrete level, the Python language provides string data types to represent character text in your scripts. The string types you will use in your scripts depend upon the version of Python you’re using. Python 2.X has a general string type for representing binary data and simple 8-bit text like ASCII, along with a specific type for representing richer Unicode text:

§ str for representing 8-bit text and binary data

§ unicode for representing decoded Unicode text

Python 2.X’s two string types are different (unicode allows for the extra size of some Unicode characters and has extra support for encoding and decoding), but their operation sets largely overlap. The str string type in 2.X is used for text that can be represented with 8-bit bytes (including ASCII and Latin-1), as well as binary data that represents absolute byte values.

By contrast, Python 3.X comes with three string object types—one for textual data and two for binary data:

§ str for representing decoded Unicode text (including ASCII)

§ bytes for representing binary data (including encoded text)

§ bytearray, a mutable flavor of the bytes type

As mentioned earlier, bytearray is also available in Python 2.6 and 2.7, but it’s simply a back-port from 3.X with less content-specific behavior and is generally considered a 3.X type.

Why the different string types?

All three string types in 3.X support similar operation sets, but they have different roles. The main goal behind this change in 3.X was to merge the normal and Unicode string types of 2.X into a single string type that supports both simple and Unicode text: developers wanted to remove the 2.X string dichotomy and make Unicode processing more natural. Given that ASCII and other 8-bit text is really a simple kind of Unicode, this convergence seems logically sound.

To achieve this, 3.X stores text in a redefined str type—an immutable sequence of characters (not necessarily bytes), which may contain either simple text such as ASCII whose character values fit in single bytes, or richer character set text such as UTF-8 whose character values may require multiple bytes. Strings processed by your script with this type are stored generically in memory, and are encoded to and decoded from byte strings per either the platform Unicode default or an explicit encoding name. This allows scripts to translate text to different encoding schemes, both in memory and when transferring to and from files.

While 3.X’s new str type does achieve the desired string/unicode merging, many programs still need to process raw binary data that is not encoded per any text format. Image and audio files, as well as packed data used to interface with devices or C programs you might process with Python’s struct module, fall into this category. Because Unicode strings are decoded from bytes, they cannot be used to represent bytes.

To support processing of such truly binary data, a new string type, bytes, also was introduced—an immutable sequence of 8-bit integers representing absolute byte values, which prints as ASCII characters when possible. Though a distinct object type, bytes supports almost all the same operations that the str type does; this includes string methods, sequence operations, and even re module pattern matching, but not string formatting. In 2.X, the general str type fills this binary data role, because its strings are just sequences of bytes; the separate unicode type handles richer text strings.

In more detail, a 3.X bytes object really is a sequence of small integers, each of which is in the range 0 through 255; indexing a bytes returns an int, slicing one returns another bytes, and running the list built-in on one returns a list of integers, not characters. When processed with operations that assume characters, though, the contents of bytes objects are assumed to be ASCII-encoded bytes (e.g., the isalpha method assumes each byte is an ASCII character code). Further, bytes objects are printed as character strings instead of integers for convenience.

While they were at it, Python developers also added a bytearray type in 3.X. bytearray is a variant of bytes that is mutable and so supports in-place changes. It supports the usual string operations that str and bytes do, as well as many of the same in-place change operations as lists (e.g., the append and extend methods, and assignment to indexes). This can be useful both for truly binary data and simple types of text. Assuming your text strings can be treated as raw 8-bit bytes (e.g., ASCII or Latin-1 text), bytearray finally adds direct in-place mutability for text data—something not possible without conversion to a mutable type in Python 2.X, and not supported by Python 3.X’s str or bytes.

Although Python 2.X and 3.X offer much the same functionality, they package it differently. In fact, the mapping from 2.X to 3.X string types is not completely direct—2.X’s str equates to both str and bytes in 3.X, and 3.X’s str equates to both str and unicode in 2.X. Moreover, the mutability of 3.X’s bytearray is unique.

In practice, though, this asymmetry is not as daunting as it might sound. It boils down to the following: in 2.X, you will use str for simple text and binary data and unicode for advanced forms of text whose character sets don’t map to 8-bit bytes; in 3.X, you’ll use str for any kind of text (ASCII, Latin-1, and all other kinds of Unicode) and bytes or bytearray for binary data. In practice, the choice is often made for you by the tools you use—especially in the case of file processing tools, the topic of the next section.

Text and Binary Files

File I/O (input and output) was also revamped in 3.X to reflect the str/bytes distinction and automatically support encoding Unicode text on transfers. Python now makes a sharp platform-independent distinction between text files and binary files; in 3.X:

Text files

When a file is opened in text mode, reading its data automatically decodes its content and returns it as a str; writing takes a str and automatically encodes it before transferring it to the file. Both reads and writes translate per a platform default or a provided encoding name. Text-mode files also support universal end-of-line translation and additional encoding specification arguments. Depending on the encoding name, text files may also automatically process the byte order mark sequence at the start of a file (more on this momentarily).

Binary files

When a file is opened in binary mode by adding a b (lowercase only) to the mode-string argument in the built-in open call, reading its data does not decode it in any way but simply returns its content raw and unchanged, as a bytes object; writing similarly takes a bytes object and transfers it to the file unchanged. Binary-mode files also accept a bytearray object for the content to be written to the file.

Because the language sharply differentiates between str and bytes, you must decide whether your data is text or binary in nature and use either str or bytes objects to represent its content in your script, as appropriate. Ultimately, the mode in which you open a file will dictate which type of object your script will use to represent its content:

§ If you are processing image files, data transferred over networks, packed binary data whose content you must extract, or some device data streams, chances are good that you will want to deal with it using bytes and binary-mode files. You might also opt for bytearray if you wish to update the data without making copies of it in memory.

§ If instead you are processing something that is textual in nature, such as program output, HTML, email content, or CSV or XML files, you’ll probably want to use str and text-mode files.

Notice that the mode string argument to built-in function open (its second argument) becomes fairly crucial in Python 3.X—its content not only specifies a file processing mode, but also implies a Python object type. By adding a b to the mode string, you specify binary mode and will receive, or must provide, a bytes object to represent the file’s content when reading or writing. Without the b, your file is processed in text mode, and you’ll use str objects to represent its content in your script. For example, the modes rb, wb, and rb+ imply bytes; r, w+, and rt (the default) imply str.

Text-mode files also handle the byte order marker (BOM) sequence that may appear at the start of files under some encoding schemes. In the UTF-16 and UTF-32 encodings, for example, the BOM specifies big- or little-endian format (essentially, which end of a bit-string is most significant)—see the leading bytes in the results of the UTF-16 and UTF-32 encoding calls we ran earlier for examples. A UTF-8 text file might also include a BOM to declare that it is UTF-8 in general. When reading and writing data using these encoding schemes, Python skips or writes the BOM according to rules we’ll study later in this chapter.

In Python 2.X, the same behavior is supported, but normal files created by open are used to access bytes-based data, and Unicode files opened with the codecs.open call are used to process Unicode text data. The latter of these also encode and decode on transfer, as we’ll see later in this chapter. First, let’s explore Python’s Unicode string model live.

Coding Basic Strings

Let’s step through a few examples that demonstrate how the 3.X string types are used. One note up front: the code in this section was run with and applies to 3.X only. Still, basic string operations are generally portable across Python versions. Simple ASCII strings represented with the strtype work the same in 2.X and 3.X (and exactly as we saw in Chapter 7 of this book).

Moreover, although there is no bytes type in Python 2.X (it has just the general str), it can usually run code that thinks there is—in 2.6 and 2.7, the call bytes(X) is present as a synonym for str(X), and the new literal form b'...' is taken to be the same as the normal string literal'...'. You may still run into version skew in some isolated cases, though; the 2.6/2.7 bytes call, for instance, does not require or allow the second argument (encoding name) that is required by 3.X’s bytes.

Python 3.X String Literals

Python 3.X string objects originate when you call a built-in function such as str or bytes, read a file created by calling open (described in the next section), or code literal syntax in your script. For the latter, a new literal form, b'xxx' (and equivalently, B'xxx') is used to create bytesobjects in 3.X, and you may create bytearray objects by calling the bytearray function, with a variety of possible arguments.

More formally, in 3.X all the current string literal forms—'xxx', "xxx", and triple-quoted blocks—generate a str; adding a b or B just before any of them creates a bytes instead. This new b'...' bytes literal is similar in form to the r'...' raw string used to suppress backslash escapes. Consider the following, run in 3.X:

C:\code> C:\python33\python

>>> B = b'spam' # 3.X bytes literal make a bytes object (8-bit bytes)

>>> S = 'eggs' # 3.X str literal makes a Unicode text string

>>> type(B), type(S)

(<class 'bytes'>, <class 'str'>)

>>> B # bytes: sequence of int, prints as character string

b'spam'

>>> S

'eggs'

The 3.X bytes object is actually a sequence of short integers, though it prints its content as characters whenever possible:

>>> B[0], S[0] # Indexing returns an int for bytes, str for str

(115, 'e')

>>> B[1:], S[1:] # Slicing makes another bytes or str object

(b'pam', 'ggs')

>>> list(B), list(S)

([115, 112, 97, 109], ['e', 'g', 'g', 's']) # bytes is really 8-bit small ints

The bytes object is also immutable, just like str (though bytearray, described later, is not); you cannot assign a str, bytes, or integer to an offset of a bytes object.

>>> B[0] = 'x' # Both are immutable

TypeError: 'bytes' object does not support item assignment

>>> S[0] = 'x'

TypeError: 'str' object does not support item assignment

Finally, note that the bytes literal’s b or B prefix also works for any string literal form, including triple-quoted blocks, though you get back a string of raw bytes that may or may not map to characters:

>>> # bytes prefix works on single, double, triple quotes, raw

>>> B = B"""

... xxxx

... yyyy

... """

>>> B

b'\nxxxx\nyyyy\n'

Python 2.X Unicode literals in Python 3.3

Python 2.X’s u'xxx' and U'xxx' Unicode string literal forms were removed in Python 3.0 because they were deemed redundant—normal strings are Unicode in 3.X. To aid both forward and backward compatibility, though, they are available again as of 3.3, where they are treated as normal str strings:

C:\code> C:\python33\python

>>> U = u'spam' # 2.X Unicode literal accepted in 3.3+

>>> type(U) # It is just str, but is backward compatible

<class 'str'>

>>> U

'spam'

>>> U[0]

's'

>>> list(U)

['s', 'p', 'a', 'm']

These literals are gone in 3.0 through 3.2, where you must use 'xxx' instead. You should generally use 3.X 'xxx' text literals in new 3.X-only code, because the 2.X form is superfluous. However, in 3.3 and later, using the 2.X literal form can ease the task of porting 2.X code, and boost 2.X code compatibility (for a case in point, see Chapter 25’s currency example, described in an upcoming note). Regardless of how text strings are coded in 3.X, though, they are all Unicode, even if they contain only ASCII characters (more on writing non-ASCII Unicode text in the sectionCoding Non-ASCII Text).

Python 2.X String Literals

All three of the 3.X string forms of the prior section can be coded in 2.X, but their meaning differs. As mentioned earlier, in Python 2.6 and 2.7 the b'xxx' bytes literal is present for forward compatibility with 3.X, but is the same as 'xxx' and makes a str (the b is ignored), and bytes is just a synonym for str; as you’ve seen, in 3.X both of these address the distinct bytes type:

C:\code> C:\python27\python

>>> B = b'spam' # 3.X bytes literal is just str in 2.6/2.7

>>> S = 'eggs' # str is a bytes/character sequence

>>> type(B), type(S)

(<type 'str'>, <type 'str'>)

>>> B, S

('spam', 'eggs')

>>> B[0], S[0]

('s', 'e')

>>> list(B), list(S)

(['s', 'p', 'a', 'm'], ['e', 'g', 'g', 's'])

In 2.X the special Unicode literal and type accommodates richer forms of text:

>>> U = u'spam' # 2.X Unicode literal makes a distinct type

>>> type(U) # Works in 3.3 too, but is just a str there

<type 'unicode'>

>>> U

u'spam'

>>> U[0]

u's'

>>> list(U)

[u's', u'p', u'a', u'm']

As we saw, for compatibility this form works in 3.3 and later too, but it simply makes a normal str there (the u is ignored).

String Type Conversions

Although Python 2.X allowed str and unicode type objects to be mixed in expressions (when the str contained only 7-bit ASCII text), 3.X draws a much sharper distinction—str and bytes type objects never mix automatically in expressions and never are converted to one another automatically when passed to functions. A function that expects an argument to be a str object won’t generally accept a bytes, and vice versa.

Because of this, Python 3.X basically requires that you commit to one type or the other, or perform manual, explicit conversions when needed:

§ str.encode() and bytes(S, encoding) translate a string to its raw bytes form and create an encoded bytes from a decoded str in the process.

§ bytes.decode() and str(B, encoding) translate raw bytes into its string form and create a decoded str from an encoded bytes in the process.

These encode and decode methods (as well as file objects, described in the next section) use either a default encoding for your platform or an explicitly passed-in encoding name. For example, in Python 3.X:

>>> S = 'eggs'

>>> S.encode() # str->bytes: encode text into raw bytes

b'eggs'

>>> bytes(S, encoding='ascii') # str->bytes, alternative

b'eggs'

>>> B = b'spam'

>>> B.decode() # bytes->str: decode raw bytes into text

'spam'

>>> str(B, encoding='ascii') # bytes->str, alternative

'spam'

Two cautions here. First of all, your platform’s default encoding is available in the sys module, but the encoding argument to bytes is not optional, even though it is in str.encode (and bytes.decode).

Second, although calls to str do not require the encoding argument like bytes does, leaving it off in str calls does not mean that it defaults—instead, a str call without an encoding returns the bytes object’s print string, not its str converted form (this is usually not what you’ll want!). Assuming B and S are still as in the prior listing:

>>> import sys

>>> sys.platform # Underlying platform

'win32'

>>> sys.getdefaultencoding() # Default encoding for str here

'utf-8'

>>> bytes(S)

TypeError: string argument without an encoding

>>> str(B) # str without encoding

"b'spam'" # A print string, not conversion!

>>> len(str(B))

7

>>> len(str(B, encoding='ascii')) # Use encoding to convert to str

4

When in doubt, pass in an encoding name argument in 3.X, even if it may have a default. Conversions are similar in Python 2.X, though 2.X’s support for mixing string types in expressions makes conversions optional for ASCII text, and the tool names differ for the different string type model—conversions in 2.X occur between encoded str and decoded unicode, rather than 3.X’s encoded bytes and decoded str:

>>> S = 'spam' # 2.X type string conversion tools

>>> U = u'eggs'

>>> S, U

('spam', u'eggs')

>>> unicode(S), str(U) # 2.X converts str->uni, uni->str

(u'spam', 'eggs')

>>> S.decode(), U.encode() # versus 3.X byte->str, str->bytes

(u'spam', 'eggs')

Coding Unicode Strings

Encoding and decoding become more meaningful when you start dealing with non-ASCII Unicode text. To code arbitrary Unicode characters in your strings, some of which you might not even be able to type on your keyboard, Python string literals support both "\xNN" hex byte value escapes and "\uNNNN" and "\UNNNNNNNN" Unicode escapes in string literals. In Unicode escapes, the first form gives four hex digits to encode a 2-byte (16-bit) character code point, and the second gives eight hex digits for a 4-byte (32-bit) code point. Byte strings support only hex escapes for encoded text and other forms of byte-based data.

Coding ASCII Text

Let’s step through some examples that demonstrate text coding basics. As we’ve seen, ASCII text is a simple type of Unicode, stored as a sequence of byte values that represent characters:

C:\code> C:\python33\python

>>> ord('X') # 'X' is binary code point value 88 in the default encoding

88

>>> chr(88) # 88 stands for character 'X'

'X'

>>> S = 'XYZ' # A Unicode string of ASCII text

>>> S

'XYZ'

>>> len(S) # Three characters long

3

>>> [ord(c) for c in S] # Three characters with integer ordinal values

[88, 89, 90]

Normal 7-bit ASCII text like this is represented with one character per byte under each of the Unicode encoding schemes described earlier in this chapter:

>>> S.encode('ascii') # Values 0..127 in 1 byte (7 bits) each

b'XYZ'

>>> S.encode('latin-1') # Values 0..255 in 1 byte (8 bits) each

b'XYZ'

>>> S.encode('utf-8') # Values 0..127 in 1 byte, 128..2047 in 2, others 3 or 4

b'XYZ'

In fact, the bytes objects returned by encoding ASCII text this way are really a sequence of short integers, which just happen to print as ASCII characters when possible:

>>> S.encode('latin-1')

b'XYZ'

>>> S.encode('latin-1')[0]

88

>>> list(S.encode('latin-1'))

[88, 89, 90]

Coding Non-ASCII Text

Formally, to code non-ASCII characters, we can use:

§ Hex or Unicode escapes to embed Unicode code point ordinal values in text strings—normal string literals in 3.X, and Unicode string literals in 2.X (and in 3.3 for compatibility).

§ Hex escapes to embed the encoded representation of characters in byte strings—normal string literals in 2.X, and bytes string literals in 3.X (and in 2.X for compatibility).

Note that text strings embed actual code point values, while byte strings embed their encoded form. The value of a character’s encoded representation in a byte string is the same as its decoded Unicode code point value in a text string for only certain characters and encodings. In any event, hex escapes are limited to coding a single byte’s value, but Unicode escapes can name characters with values 2 and 4 bytes wide. The chr function can also be used to create a single non-ASCII character from its code point value, and as we’ll see later, source code declarations apply to such characters embedded in your script.

For instance, the hex values 0xCD and 0xE8 are codes for two special accented characters outside the 7-bit range of ASCII, but we can embed them in 3.X str objects because str supports Unicode:

>>> chr(0xc4) # 0xC4, 0xE8: characters outside ASCII's range

'Ä'

>>> chr(0xe8)

'è'

>>> S = '\xc4\xe8' # Single 8-bit value hex escapes: two digits

>>> S

'Äè'

>>> S = '\u00c4\u00e8' # 16-bit Unicode escapes: four digits each

>>> S

'Äè'

>>> len(S) # Two characters long (not number of bytes!)

2

Note that in Unicode text string literals like these, hex and Unicode escapes denote a Unicode code point value, not byte values. The x hex escapes require exactly two digits (for 8-bit code point values), and u and U Unicode escapes require exactly four and eight hexadecimal digits, respectively, for denoting code point values that can be as big as 16 and 32 bits will allow:

>>> S = '\U000000c4\U000000e8' # 32-bit Unicode escapes: eight digits each

>>> S

'Äè'

As shown later, Python 2.X works similarly in this regard, but Unicode escapes are allowed only in its Unicode literal form. They work in normal string literals in 3.X here simply because its normal strings are always Unicode.

Encoding and Decoding Non-ASCII text

Now, if we try to encode the prior section’s non-ASCII text string into raw bytes using as ASCII, we’ll get an error, because its characters are outside ASCII’s 7-bit code point value range:

>>> S = '\u00c4\u00e8' # Non-ASCII text string, two characters long

>>> S

'Äè'

>>> len(S)

2

>>> S.encode('ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1:

ordinal not in range(128)

Encoding this as Latin-1 works, though, because each character falls into that encoding’s 8-bit range, and we get 1 byte per character allocated in the encoded byte string. Encoding as UTF-8 also works: this encoding supports a wide range of Unicode code points, but allocates 2 bytes per non-ASCII character instead. If these encoded strings are written to a file, the raw bytes shown here for encoding results are what is actually stored on the file for the encoding types given:

>>> S.encode('latin-1') # 1 byte per character when encoded

b'\xc4\xe8'

>>> S.encode('utf-8') # 2 bytes per character when encoded

b'\xc3\x84\xc3\xa8'

>>> len(S.encode('latin-1')) # 2 bytes in latin-1, 4 in utf-8

2

>>> len(S.encode('utf-8'))

4

Note that you can also go the other way, reading raw bytes from a file and decoding them back to a Unicode string. However, as we’ll see later, the encoding mode you give to the open call causes this decoding to be done for you automatically on input (and avoids issues that may arise from reading partial character sequences when reading by blocks of bytes):

>>> B = b'\xc4\xe8' # Text encoded per Latin-1

>>> B

b'\xc4\xe8'

>>> len(B) # 2 raw bytes, two encoded characters

2

>>> B.decode('latin-1') # Decode to text per Latin-1

'Äè'

>>> B = b'\xc3\x84\xc3\xa8' # Text encoded per UTF-8

>>> len(B) # 4 raw bytes, two encoded characters

4

>>> B.decode('utf-8') # Decode to text per UTF-8

'Äè'

>>> len(B.decode('utf-8')) # Two Unicode characters in memory

2

Other Encoding Schemes

Some encodings use even larger byte sequences to represent characters. When needed, you can specify both 16- and 32-bit Unicode code point values for characters in your strings—as shown earlier, we can use "\u..." with four hex digits for the former, and "\U..." with eight hex digits for the latter, and can mix these in literals with simpler ASCII characters freely:

>>> S = 'A\u00c4B\U000000e8C'

>>> S # A, B, C, and 2 non-ASCII characters

'AÄBèC'

>>> len(S) # Five characters long

5

>>> S.encode('latin-1')

b'A\xc4B\xe8C'

>>> len(S.encode('latin-1')) # 5 bytes when encoded per latin-1

5

>>> S.encode('utf-8')

b'A\xc3\x84B\xc3\xa8C'

>>> len(S.encode('utf-8')) # 7 bytes when encoded per utf-8

7

Technically speaking, you can also build Unicode strings piecemeal using chr instead of Unicode or hex escapes, but this might become tedious for large strings:

>>> S = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'

>>> S

'AÄBèC'

Some other encodings may use very different byte formats, though. The cp500 EBCDIC encoding, for example, doesn’t even encode ASCII the same way as the encodings we’ve been using so far; since Python encodes and decodes for us, we only generally need to care about this when providing encoding names for data sources:

>>> S

'AÄBèC'

>>> S.encode('cp500') # Two other Western European encodings

b'\xc1c\xc2T\xc3'

>>> S.encode('cp850') # 5 bytes each, different encoded values

b'A\x8eB\x8aC'

>>> S = 'spam' # ASCII text is the same in most

>>> S.encode('latin-1')

b'spam'

>>> S.encode('utf-8')

b'spam'

>>> S.encode('cp500') # But not in cp500: IBM EBCDIC!

b'\xa2\x97\x81\x94'

>>> S.encode('cp850')

b'spam'

The same holds true for the UTF-16 and UTF-32 encodings, which use fixed 2- and 4-byte-per-character schemes with same-sized headers—non-ASCII encodes differently, and ASCII is not 1 byte per character:

>>> S = 'A\u00c4B\U000000e8C'

>>> S.encode('utf-16')

b'\xff\xfeA\x00\xc4\x00B\x00\xe8\x00C\x00'

>>> S = 'spam'

>>> S.encode('utf-16')

b'\xff\xfes\x00p\x00a\x00m\x00'

>>> S.encode('utf-32')

b'\xff\xfe\x00\x00s\x00\x00\x00p\x00\x00\x00a\x00\x00\x00m\x00\x00\x00'

Byte String Literals: Encoded Text

Two cautions here too. First, Python 3.X allows special characters to be coded with both hex and Unicode escapes in str strings, but only with hex escapes in bytes strings—Unicode escape sequences are silently taken verbatim in bytes literals, not as escapes. In fact, bytes must be decoded to str strings to print their non-ASCII characters properly:

>>> S = 'A\xC4B\xE8C' # 3.X: str recognizes hex and Unicode escapes

>>> S

'AÄBèC'

>>> S = 'A\u00C4B\U000000E8C'

>>> S

'AÄBèC'

>>> B = b'A\xC4B\xE8C' # bytes recognizes hex but not Unicode

>>> B

b'A\xc4B\xe8C'

>>> B = b'A\u00C4B\U000000E8C' # Escape sequences taken literally!

>>> B

b'A\\u00C4B\\U000000E8C'

>>> B = b'A\xC4B\xE8C' # Use hex escapes for bytes

>>> B # Prints non-ASCII as hex

b'A\xc4B\xe8C'

>>> print(B)

b'A\xc4B\xe8C'

>>> B.decode('latin-1') # Decode as latin-1 to interpret as text

'AÄBèC'

Second, bytes literals require characters either to be ASCII characters or, if their values are greater than 127, to be escaped; str stings, on the other hand, allow literals containing any character in the source character set—which, as discussed later, defaults to UTF-8 unless an encoding declaration is given in the source file:

>>> S = 'AÄBèC' # Chars from UTF-8 if no encoding declaration

>>> S

'AÄBèC'

>>> B = b'AÄBèC'

SyntaxError: bytes can only contain ASCII literal characters.

>>> B = b'A\xC4B\xE8C' # Chars must be ASCII, or escapes

>>> B

b'A\xc4B\xe8C'

>>> B.decode('latin-1')

'AÄBèC'

>>> S.encode() # Source code encoded per UTF-8 by default

b'A\xc3\x84B\xc3\xa8C' # Uses system default to encode, unless passed

>>> S.encode('utf-8')

b'A\xc3\x84B\xc3\xa8C'

>>> B.decode() # Raw bytes do not correspond to utf-8

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: ...

Both these constraints make sense if you remember that byte strings hold bytes-based data, not decoded Unicode code point ordinals; while they may contain the encoded form of text, decoded code point values don’t quite apply to byte strings unless the characters are first encoded.

Converting Encodings

So far, we’ve been encoding and decoding strings to inspect their structure. It’s also possible to convert a string to a different encoding than its original, but we must provide an explicit encoding name to encode to and decode from. This is true whether the original text string originated in a file or a literal.

The term conversion may be a misnomer here—it really just means encoding a text string to raw bytes per a different encoding scheme than the one it was decoded from. As stressed earlier, decoded text in memory has no encoding type, and is simply a string of Unicode code points (a.k.a. characters); there is no concept of changing its encoding in this form. Still, this scheme allows scripts to read data in one encoding and store it in another, to support multiple clients of the same data:

>>> B = b'A\xc3\x84B\xc3\xa8C' # Text encoded in UTF-8 format originally

>>> S = B.decode('utf-8') # Decode to Unicode text per UTF-8

>>> S

'AÄBèC'

>>> T = S.encode('cp500') # Convert to encoded bytes per EBCDIC

>>> T

b'\xc1c\xc2T\xc3'

>>> U = T.decode('cp500') # Convert back to Unicode per EBCDIC

>>> U

'AÄBèC'

>>> U.encode() # Per default utf-8 encoding again

b'A\xc3\x84B\xc3\xa8C'

Keep in mind that the special Unicode and hex character escapes are only necessary when you code non-ASCII Unicode strings manually. In practice, you’ll often load such text from files instead. As we’ll see later in this chapter, 3.X’s file object (created with the open built-in function) automatically decodes text strings as they are read and encodes them when they are written; because of this, your script can often deal with strings generically, without having to code special characters directly.

Later in this chapter we’ll also see that it’s possible to convert between encodings when transferring strings to and from files, using a technique very similar to that in the last example; although you’ll still need to provide explicit encoding names when opening a file, the file interface does most of the conversion work for you automatically.

Coding Unicode Strings in Python 2.X

I stress Python 3.X Unicode support in this chapter because it’s new. But now that I’ve shown you the basics of Unicode strings in 3.X, I need to explain more fully how you can do much the same in 2.X, though the tools differ. unicode is available in Python 2.X, but is a distinct type fromstr, supports most of the same operations, and allows mixing of normal and Unicode strings when the str is all ASCII.

In fact, you can essentially pretend 2.X’s str is 3.X’s bytes when it comes to decoding raw bytes into a Unicode string, as long as it’s in the proper form. Here is 2.X in action; Unicode characters display in hex in 2.X unless you explicitly print, and non-ASCII displays can vary per shell (most of this section ran outside IDLE, which sometimes detects and prints Latin-1 characters in encoded byte strings—see ahead for more on PYTHONIOENCODING and Windows Command Prompt display issues):

C:\code> C:\python27\python

>>> S = 'A\xC4B\xE8C' # String of 8-bit bytes

>>> S # Text encoded per Latin-1, some non-ASCII

'A\xc4B\xe8C'

>>> print S # Nonprintable characters (IDLE may differ)

A─BΦC

>>> U = S.decode('latin1') # Decode bytes to Unicode text per latin-1

>>> U

u'A\xc4B\xe8C'

>>> print U

AÄBèC

>>> S.decode('utf-8') # Encoded form not compatible with utf-8

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 1: invalid c

ontinuation byte

>>> S.decode('ascii') # Encoded bytes are also outside ASCII range

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal

not in range(128)

To code Unicode text, make a unicode object with the u'xxx' literal form (as mentioned, this literal is available again in 3.3, but superfluous in 3.X in general, since its normal strings support Unicode):

>>> U = u'A\xC4B\xE8C' # Make Unicode string, hex escapes

>>> U

u'A\xc4B\xe8C'

>>> print U

AÄBèC

Once you’ve created it, you can convert Unicode text to different raw byte encodings, similar to encoding str objects into bytes objects in 3.X:

>>> U.encode('latin-1') # Encode per latin-1: 8-bit bytes

'A\xc4B\xe8C'

>>> U.encode('utf-8') # Encode per utf-8: multibyte

'A\xc3\x84B\xc3\xa8C'

Non-ASCII characters can be coded with hex or Unicode escapes in string literals in 2.X, just as in 3.X. However, as with bytes in 3.X, the "\u..." and "\U..." escapes are recognized only for unicode strings in 2.X, not 8-bit str strings—again, these are used to give the values of decoded Unicode ordinal integers, which don’t make sense in a raw byte string:

C:\code> C:\python27\python

>>> U = u'A\xC4B\xE8C' # Hex escapes for non-ASCII

>>> U

u'A\xc4B\xe8C'

>>> print U

AÄBèC

>>> U = u'A\u00C4B\U000000E8C' # Unicode escapes for non-ASCII

>>> U # u'' = 16 bits, U'' = 32 bits

u'A\xc4B\xe8C'

>>> print U

AÄBèC

>>> S = 'A\xC4B\xE8C' # Hex escapes work

>>> S

'A\xc4B\xe8C'

>>> print S # But some may print oddly, unless decoded

A─BΦC

>>> print S.decode('latin-1')

AÄBèC

>>> S = 'A\u00C4B\U000000E8C' # Not Unicode escapes: taken literally!

>>> S

'A\\u00C4B\\U000000E8C'

>>> print S

A\u00C4B\U000000E8C

>>> len(S)

19

Mixing string types in 2.X

Like 3.X’s str and bytes, 2.X’s unicode and str share nearly identical operation sets, so unless you need to convert to other encodings you can often treat unicode as though it were str. One of the primary differences between 2.X and 3.X, though, is that unicode and non-Unicodestr objects can be freely mixed in 2.X expressions—as long as the str is compatible with the unicode object, Python will automatically convert it up to unicode:

>>> u'ab' + 'cd' # Can mix if compatible in 2.X

u'abcd' # But 'ab' + b'cd' not allowed in 3.X

However, this liberal approach to mixing string types in 2.X works only if the 8-bit string happens to contain only 7-bit (ASCII) bytes:

>>> S = 'A\xC4B\xE8C' # Can't mix in 2.X if str is non-ASCII!

>>> U = u'A\xC4B\xE8C'

>>> S + U

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal

not in range(128)

>>> 'abc' + U # Can mix only if str is all 7-bit ASCII

u'abcA\xc4B\xe8C'

>>> print 'abc' + U # Use print to display characters

abcAÄBèC

>>> S.decode('latin-1') + U # Manual conversion may be required in 2.X too

u'A\xc4B\xe8CA\xc4B\xe8C'

>>> print S.decode('latin-1') + U

AÄBèCAÄBèC

>>> print u'\xA3' + '999.99' # Also see Chapter 25's currency example

£999.99

By contrast, in 3.X, str and bytes never mix automatically and require manual conversions—the preceding code actually runs in 3.3, but only because 2.X’s Unicode literal is taken to be the same as a normal string by 3.X (the u is ignored); the 3.X equivalent would be a str added to abytes (i.e., 'ab' + b'cd') which fails in 3.X, unless objects are converted to a common type.

In 2.X, though, the difference in types is often trivial to your code. Like normal strings, Unicode strings may be concatenated, indexed, sliced, matched with the re module, and so on, and they cannot be changed in place. If you ever need to convert between the two types explicitly, you can use the built-in str and unicode functions as shown earlier:

>>> str(u'spam') # Unicode to normal

'spam'

>>> unicode('spam') # Normal to Unicode

u'spam'

If you are using Python 2.X, also watch for an example of your different file interface later in this chapter. Your open call supports only files of 8-bit bytes, returning their contents as str strings, and it’s up to you to interpret the contents as text or binary data and decode if needed. To read and write Unicode files and encode or decode their content automatically, use 2.X’s codecs.open call we’ll see in action later in this chapter. This call provides much the same functionality as 3.X’s open and uses 2.X unicode objects to represent file content—reading a file translates encoded bytes into decoded Unicode characters, and writing translates strings to the desired encoding specified when the file is opened.

Source File Character Set Encoding Declarations

Finally, Unicode escape codes are fine for the occasional Unicode character in string literals, but they can become tedious if you need to embed non-ASCII text in your strings frequently. To interpret the content of strings you code and hence embed within the text of your script files, Python uses the UTF-8 encoding by default, but it allows you to change this to support arbitrary character sets by including a comment that names your desired encoding. The comment must be of this form and must appear as either the first or second line in your script in either Python 2.X or 3.X:

# -*- coding: latin-1 -*-

When a comment of this form is present, Python will recognize strings represented natively in the given encoding. This means you can edit your script file in a text editor that accepts and displays accented and other non-ASCII characters correctly, and Python will decode them correctly in your string literals. For example, notice how the comment at the top of the following file, text.py, allows Latin-1 characters to be embedded in strings, which are themselves embedded in the script file’s text:

# -*- coding: latin-1 -*-

# Any of the following string literal forms work in latin-1.

# Changing the encoding above to either ascii or utf-8 fails,

# because the 0xc4 and 0xe8 in myStr1 are not valid in either.

myStr1 = 'aÄBèC'

myStr2 = 'A\u00c4B\U000000e8C'

myStr3 = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'

import sys

print('Default encoding:', sys.getdefaultencoding())

for aStr in myStr1, myStr2, myStr3:

print('{0}, strlen={1}, '.format(aStr, len(aStr)), end='')

bytes1 = aStr.encode() # Per default utf-8: 2 bytes for non-ASCII

bytes2 = aStr.encode('latin-1') # One byte per char

#bytes3 = aStr.encode('ascii') # ASCII fails: outside 0..127 range

print('byteslen1={0}, byteslen2={1}'.format(len(bytes1), len(bytes2)))

When run, this script produces the following output, giving, for each of three coding techniques, the string, its length, and the lengths of its UTF-8 and Latin-1 encoded byte string forms.

C:\code> C:\python33\python text.py

Default encoding: utf-8

aÄBèC, strlen=5, byteslen1=7, byteslen2=5

AÄBèC, strlen=5, byteslen1=7, byteslen2=5

AÄBèC, strlen=5, byteslen1=7, byteslen2=5

Since many programmers are likely to fall back on the standard UTF-8 encoding, I’ll defer to Python’s standard manual set for more details on this option and other advanced Unicode support topics, such as properties and character name escapes in strings I’m omitting here. For this chapter, let’s take a quick look at the new byte string object types in Python 3.X, before moving on to its file and tool changes.

NOTE

For an additional example of non-ASCII character coding and source file declarations, see the currency symbols used in the money formatting example of Chapter 25, as well as its associated file in this book’s examples package, formats_currency2.py. The latter requires a source-file declaration to be usable by Python, because it embeds non-ASCII currency symbol characters. This example also illustrates the portability gains possible when using 2.X’s Unicode literal in 3.X code in 3.3 and later.

Using 3.X bytes Objects

We studied a wide variety of operations available for Python 3.X’s general str string type in Chapter 7; the basic string type works identically in 2.X and 3.X, so we won’t rehash this topic. Instead, let’s dig a bit deeper into the operation sets provided by the new bytes type in 3.X.

As mentioned previously, the 3.X bytes object is a sequence of small integers, each of which is in the range 0 through 255, that happens to print as ASCII characters when displayed. It supports sequence operations and most of the same methods available on str objects (and present in 2.X’s str type). However, bytes does not support the format method or the % formatting expression, and you cannot mix and match bytes and str type objects without explicit conversions—you generally will use all str type objects and text files for text data, and all bytes type objects and binary files for binary data.

Method Calls

If you really want to see what attributes str has that bytes doesn’t, you can always check their dir built-in function results. The output can also tell you something about the expression operators they support (e.g., __mod__ and __rmod__ implement the % operator):

C:\code> C:\python33\python

# Attributes in str but not bytes

>>> set(dir('abc')) - set(dir(b'abc'))

{'isdecimal', '__mod__', '__rmod__', 'format_map', 'isprintable',

'casefold', 'format', 'isnumeric', 'isidentifier', 'encode'}

# Attributes in bytes but not str

>>> set(dir(b'abc')) - set(dir('abc'))

{'decode', 'fromhex'}

As you can see, str and bytes have almost identical functionality. Their unique attributes are generally methods that don’t apply to the other; for instance, decode translates a raw bytes into its str representation, and encode translates a string into its raw bytes representation. Most of the methods are the same, though bytes methods require bytes arguments (again, 3.X string types don’t mix). Also recall that bytes objects are immutable, just like str objects in both 2.X and 3.X (error messages here have been shortened for brevity):

>>> B = b'spam' # b'...' bytes literal

>>> B.find(b'pa')

1

>>> B.replace(b'pa', b'XY') # bytes methods expect bytes arguments

b'sXYm'

>>> B.split(b'pa') # bytes methods return bytes results

[b's', b'm']

>>> B

b'spam'

>>> B[0] = 'x'

TypeError: 'bytes' object does not support item assignment

One notable difference is that string formatting works only on str objects in 3.X, not on bytes objects (see Chapter 7 for more on string formatting expressions and methods):

>>> '%s' % 99

'99'

>>> b'%s' % 99

TypeError: unsupported operand type(s) for %: 'bytes' and 'int'

>>> '{0}'.format(99)

'99'

>>> b'{0}'.format(99)

AttributeError: 'bytes' object has no attribute 'format'

Sequence Operations

Besides method calls, all the usual generic sequence operations you know (and possibly love) from Python 2.X strings and lists work as expected on both str and bytes in 3.X; this includes indexing, slicing, concatenation, and so on. Notice in the following that indexing a bytes object returns an integer giving the byte’s binary value; bytes really is a sequence of 8-bit integers, but for convenience prints as a string of ASCII-coded characters where possible when displayed as a whole. To check a given byte’s value, use the chr built-in to convert it back to its character, as in the following:

>>> B = b'spam' # A sequence of small ints

>>> B # Prints as ASCII characters (and/or hex escapes)

b'spam'

>>> B[0] # Indexing yields an int

115

>>> B[-1]

109

>>> chr(B[0]) # Show character for int

's'

>>> list(B) # Show all the byte's int values

[115, 112, 97, 109]

>>> B[1:], B[:-1]

(b'pam', b'spa')

>>> len(B)

4

>>> B + b'lmn'

b'spamlmn'

>>> B * 4

b'spamspamspamspam'

Other Ways to Make bytes Objects

So far, we’ve been mostly making bytes objects with the b'...' literal syntax. We can also create them by calling the bytes constructor with a str and an encoding name, calling the bytes constructor with an iterable of integers representing byte values, or encoding a str object per the default (or passed-in) encoding. As we’ve seen, encoding takes a text str and returns the raw encoded byte values of the string per the encoding specified; conversely, decoding takes a raw bytes sequence and translates it to its str text string representation—a series of Unicode characters. Both operations create new string objects:

>>> B = b'abc' # Literal

>>> B

b'abc'

>>> B = bytes('abc', 'ascii') # Constructor with encoding name

>>> B

b'abc'

>>> ord('a')

97

>>> B = bytes([97, 98, 99]) # Integer iterable

>>> B

b'abc'

>>> B = 'spam'.encode() # str.encode() (or bytes())

>>> B

b'spam'

>>>

>>> S = B.decode() # bytes.decode() (or str())

>>> S

'spam'

From a functional perspective, the last two of these operations are really tools for converting between str and bytes, a topic introduced earlier and expanded upon in the next section.

Mixing String Types

In the replace call of the section Method Calls, we had to pass in two bytes objects—str types won’t work there. Although Python 2.X automatically converts str to and from unicode when possible (i.e., when the str is 7-bit ASCII text), Python 3.X requires specific string types in some contexts and expects manual conversions if needed:

# Must pass expected types to function and method calls

>>> B = b'spam'

>>> B.replace('pa', 'XY')

TypeError: expected an object with the buffer interface

>>> B.replace(b'pa', b'XY')

b'sXYm'

>>> B = B'spam'

>>> B.replace(bytes('pa'), bytes('xy'))

TypeError: string argument without an encoding

>>> B.replace(bytes('pa', 'ascii'), bytes('xy', 'utf-8'))

b'sxym'

# Must convert manually in 3.X mixed-type expressions

>>> b'ab' + 'cd'

TypeError: can't concat bytes to str

>>> b'ab'.decode() + 'cd' # bytes to str

'abcd'

>>> b'ab' + 'cd'.encode() # str to bytes

b'abcd'

>>> b'ab' + bytes('cd', 'ascii') # str to bytes

b'abcd'

Although you can create bytes objects yourself to represent packed binary data, they can also be made automatically by reading files opened in binary mode, as we’ll see in more detail later in this chapter. First, though, let’s introduce bytes’s very close, and mutable, cousin.

Using 3.X/2.6+ bytearray Objects

So far we’ve focused on str and bytes, because they subsume Python 2’s unicode and str. Python 3.X grew a third string type, though—bytearray, a mutable sequence of integers in the range 0 through 255, which is a mutable variant of bytes. As such, it supports the same string methods and sequence operations as bytes, as well as many of the mutable in-place-change operations supported by lists.

Bytearrays support in-place changes to both truly binary data as well as simple forms of text such as ASCII, which can be represented with 1 byte per character (richer Unicode text generally requires Unicode strings, which are still immutable). The bytearray type is also available in Python 2.6 and 2.7 as a back-port from 3.X, but it does not enforce the strict text/binary distinction there that it does in 3.X.

bytearrays in Action

Let’s take a quick tour. We can create bytearray objects by calling the bytearray built-in. In Python 2.X, any string may be used to initialize:

# Creation in 2.6/2.7: a mutable sequence of small (0..255) ints

>>> S = 'spam'

>>> C = bytearray(S) # A back-port from 3.X in 2.6+

>>> C # b'..' == '..' in 2.6+ (str)

bytearray(b'spam')

In Python 3.X, an encoding name or byte string is required, because text and binary strings do not mix (though byte strings may reflect encoded Unicode text):

# Creation in 3.X: text/binary do not mix

>>> S = 'spam'

>>> C = bytearray(S)

TypeError: string argument without an encoding

>>> C = bytearray(S, 'latin1') # A content-specific type in 3.X

>>> C

bytearray(b'spam')

>>> B = b'spam' # b'..' != '..' in 3.X (bytes/str)

>>> C = bytearray(B)

>>> C

bytearray(b'spam')

Once created, bytearray objects are sequences of small integers like bytes and are mutable like lists, though they require an integer for index assignments, not a string (all of the following is a continuation of this session and is run under Python 3.X unless otherwise noted—see comments for 2.X usage notes):

# Mutable, but must assign ints, not strings

>>> C[0]

115

>>> C[0] = 'x' # This and the next work in 2.6/2.7

TypeError: an integer is required

>>> C[0] = b'x'

TypeError: an integer is required

>>> C[0] = ord('x') # Use ord() to get a character's ordinal

>>> C

bytearray(b'xpam')

>>> C[1] = b'Y'[0] # Or index a byte string

>>> C

bytearray(b'xYam')

Processing bytearray objects borrows from both strings and lists, since they are mutable byte strings. While the byterrray’s methods overlap with both str and bytes, it also has many of the list’s mutable methods. Besides named methods, the __iadd__ and __setitem__methods in bytearray implement += in-place concatenation and index assignment, respectively:

# in bytes but not bytearray

>>> set(dir(b'abc')) - set(dir(bytearray(b'abc')))

{'__getnewargs__'}

# in bytearray but not bytes

>>> set(dir(bytearray(b'abc'))) - set(dir(b'abc'))

{'__iadd__', 'reverse', '__setitem__', 'extend', 'copy', '__alloc__',

'__delitem__', '__imul__', 'remove', 'clear', 'insert', 'append', 'pop'}

You can change a bytearray in place with both index assignment, as you’ve just seen, and list-like methods like those shown here (to change text in place prior to 2.6, you would need to convert to and then from a list, with list(str) and ''.join(list)—see Chapter 4 and Chapter 6for examples):

# Mutable method calls

>>> C

bytearray(b'xYam')

>>> C.append(b'LMN') # 2.X requires string of size 1

TypeError: an integer is required

>>> C.append(ord('L'))

>>> C

bytearray(b'xYamL')

>>> C.extend(b'MNO')

>>> C

bytearray(b'xYamLMNO')

All the usual sequence operations and string methods work on bytearrays, as you would expect (notice that like bytes objects, their expressions and methods expect bytes arguments, not str arguments):

# Sequence operations and string methods

>>> C

bytearray(b'xYamLMNO')

>>> C + b'!#'

bytearray(b'xYamLMNO!#')

>>> C[0]

120

>>> C[1:]

bytearray(b'YamLMNO')

>>> len(C)

8

>>> C.replace('xY', 'sp') # This works in 2.X

TypeError: Type str doesn't support the buffer API

>>> C.replace(b'xY', b'sp')

bytearray(b'spamLMNO')

>>> C

bytearray(b'xYamLMNO')

>>> C * 4

bytearray(b'xYamLMNOxYamLMNOxYamLMNOxYamLMNO')

Python 3.X String Types Summary

Finally, by way of summary, the following examples demonstrate how bytes and bytearray objects are sequences of ints, and str objects are sequences of characters:

# Binary versus text

>>> B # B is same as S in 2.6/2.7

b'spam'

>>> list(B)

[115, 112, 97, 109]

>>> C

bytearray(b'xYamLMNO')

>>> list(C)

[120, 89, 97, 109, 76, 77, 78, 79]

>>> S

'spam'

>>> list(S)

['s', 'p', 'a', 'm']

Although all three Python 3.X string types can contain character values and support many of the same operations, again, you should always:

§ Use str for textual data.

§ Use bytes for binary data.

§ Use bytearray for binary data you wish to change in place.

Related tools such as files, the next section’s topic, often make the choice for you.

Using Text and Binary Files

This section expands on the impact of Python 3.X’s string model on the file processing basics introduced earlier in the book. As mentioned earlier, the mode in which you open a file is crucial—it determines which object type you will use to represent the file’s content in your script. Text mode implies str objects, and binary mode implies bytes objects:

§ Text-mode files interpret file contents according to a Unicode encoding—either the default for your platform, or one whose name you pass in. By passing in an encoding name to open, you can force conversions for various types of Unicode files. Text-mode files also perform universalline-end translations: by default, all line-end forms map to the single '\n' character in your script, regardless of the platform on which you run it. As described earlier, text files also handle reading and writing the byte order mark (BOM) stored at the start-of-file in some Unicode encoding schemes.

§ Binary-mode files instead return file content to you raw, as a sequence of integers representing byte values, with no encoding or decoding and no line-end translations.

The second argument to open determines whether you want text or binary processing, just as it does in 2.X Python—adding a b to this string implies binary mode (e.g., "rb" to read binary data files). The default mode is "rt"; this is the same as "r", which means text input (just as in 2.X).

In 3.X, though, this mode argument to open also implies an object type for file content representation, regardless of the underlying platform—text files return a str for reads and expect one for writes, but binary files return a bytes for reads and expect one (or a bytearray) for writes.

Text File Basics

To demonstrate, let’s begin with basic file I/O. As long as you’re processing basic text files (e.g., ASCII) and don’t care about circumventing the platform-default encoding of strings, files in 3.X look and feel much as they do in 2.X (for that matter, so do strings in general). The following, for instance, writes one line of text to a file and reads it back in 3.X, exactly as it would in 2.X (note that file is no longer a built-in name in 3.X, so it’s perfectly OK to use it as a variable here):

C:\code> C:\python33\python

# Basic text files (and strings) work the same as in 2.X

>>> file = open('temp', 'w')

>>> size = file.write('abc\n') # Returns number of characters written

>>> file.close() # Manual close to flush output buffer

>>> file = open('temp') # Default mode is "r" (== "rt"): text input

>>> text = file.read()

>>> text

'abc\n'

>>> print(text)

abc

Text and Binary Modes in 2.X and 3.X

In Python 2.X, there is no major distinction between text and binary files—both accept and return content as str strings. The only major difference is that text files automatically map \n end-of-line characters to and from \r\n on Windows, while binary files do not (I’m stringing operations together into one-liners here just for brevity):

C:\code> C:\python27\python

>>> open('temp', 'w').write('abd\n') # Write in text mode: adds \r

>>> open('temp', 'r').read() # Read in text mode: drops \r

'abd\n'

>>> open('temp', 'rb').read() # Read in binary mode: verbatim

'abd\r\n'

>>> open('temp', 'wb').write('abc\n') # Write in binary mode

>>> open('temp', 'r').read() # \n not expanded to \r\n

'abc\n'

>>> open('temp', 'rb').read()

'abc\n'

In Python 3.X, things are a bit more complex because of the distinction between str for text data and bytes for binary data. To demonstrate, let’s write a text file and read it back in both modes in 3.X. Notice that we are required to provide a str for writing, but reading gives us a str or abytes, depending on the open mode:

C:\code> C:\python33\python

# Write and read a text file

>>> open('temp', 'w').write('abc\n') # Text mode output, provide a str

4

>>> open('temp', 'r').read() # Text mode input, returns a str

'abc\n'

>>> open('temp', 'rb').read() # Binary mode input, returns a bytes

b'abc\r\n'

Notice how on Windows text-mode files translate the \n end-of-line character to \r\n on output; on input, text mode translates the \r\n back to \n, but binary-mode files do not. This is the same in 2.X, and it’s normally what we want—text files should for portability map end-of-line markers to and from \n (which is what is actually present in files in Linux, where no mapping occurs), and such translations should never occur for binary data (where end-of-line bytes are irrelevant). Although you can control this behavior with extra open arguments in 3.X if desired, the default usually works well.

Now let’s do the same again, but with a binary file. We provide a bytes to write in this case, and we still get back a str or a bytes, depending on the input mode:

# Write and read a binary file

>>> open('temp', 'wb').write(b'abc\n') # Binary mode output, provide a bytes

4

>>> open('temp', 'r').read() # Text mode input, returns a str

'abc\n'

>>> open('temp', 'rb').read() # Binary mode input, returns a bytes

b'abc\n'

Note that the \n end-of-line character is not expanded to \r\n in binary-mode output—again, a desired result for binary data. Type requirements and file behavior are the same even if the data we’re writing to the binary file is truly binary in nature. In the following, for example, the "\x00"is a binary zero byte and not a printable character:

# Write and read truly binary data

>>> open('temp', 'wb').write(b'a\x00c') # Provide a bytes

3

>>> open('temp', 'r').read() # Receive a str

'a\x00c'

>>> open('temp', 'rb').read() # Receive a bytes

b'a\x00c'

Binary-mode files always return contents as a bytes object, but accept either a bytes or bytearray object for writing; this naturally follows, given that bytearray is basically just a mutable variant of bytes. In fact, most APIs in Python 3.X that accept a bytes also allow abytearray:

# bytearrays work too

>>> BA = bytearray(b'\x01\x02\x03')

>>> open('temp', 'wb').write(BA)

3

>>> open('temp', 'r').read()

'\x01\x02\x03'

>>> open('temp', 'rb').read()

b'\x01\x02\x03'

Type and Content Mismatches in 3.X

Notice that you cannot get away with violating Python’s str/bytes type distinction when it comes to files. As the following examples illustrate, we get errors (shortened here) if we try to write a bytes to a text file or a str to a binary file (the exact text of the error messages here is prone to change):

# Types are not flexible for file content

>>> open('temp', 'w').write('abc\n') # Text mode makes and requires str

4

>>> open('temp', 'w').write(b'abc\n')

TypeError: must be str, not bytes

>>> open('temp', 'wb').write(b'abc\n') # Binary mode makes and requires bytes

4

>>> open('temp', 'wb').write('abc\n')

TypeError: 'str' does not support the buffer interface

This makes sense: text has no meaning in binary terms, before it is encoded. Although it is often possible to convert between the types by encoding str and decoding bytes, as described earlier in this chapter, you will usually want to stick to either str for text data or bytes for binary data. Because the str and bytes operation sets largely intersect, the choice won’t be much of a dilemma for most programs (see the string tools coverage in the final section of this chapter for some prime examples of this).

In addition to type constraints, file content can matter in 3.X. Text-mode output files require a str instead of a bytes for content, so there is no way in 3.X to write truly binary data to a text-mode file. Depending on the encoding rules, bytes outside the default character set can sometimes be embedded in a normal string, and they can always be written in binary mode (some of the following raise errors when displaying their string results in Pythons prior to 3.3, but the file operations work successfully):

# Can't read truly binary data in text mode

>>> chr(0xFF) # FF is a valid char, FE is not

'ÿ'

>>> chr(0xFE) # An error in some Pythons

'\xfe'

>>> open('temp', 'w').write(b'\xFF\xFE\xFD') # Can't use arbitrary bytes!

TypeError: must be str, not bytes

>>> open('temp', 'w').write('\xFF\xFE\xFD') # Can write if embeddable in str

3

>>> open('temp', 'wb').write(b'\xFF\xFE\xFD') # Can also write in binary mode

3

>>> open('temp', 'rb').read() # Can always read as binary bytes

b'\xff\xfe\xfd'

>>> open('temp', 'r').read() # Can't read text unless decodable!

'ÿ\xfe\xfd' # An error in some Pythons

In general, however, because text-mode input files in 3.X must be able to decode content per a Unicode encoding, there is no way to read truly binary data in text mode, as the next section explains.

Using Unicode Files

So far, we’ve been reading and writing basic text and binary files. It turns out to be easy to read and write Unicode text stored in files too, because the 3.X open call accepts an encoding for text files, and arranges to run the required encoding and decoding for us automatically as data is transferred. This allows us to process a variety of Unicode text created with different encodings than the default for the platform, and store the same text in different encodings for different purposes.

Reading and Writing Unicode in 3.X

In fact, we can effectively convert a string to different encoded forms both manually with method calls as we did earlier, and automatically on file input and output. We’ll use the following Unicode string in this section to demonstrate:

C:\code> C:\python33\python

>>> S = 'A\xc4B\xe8C' # Five-character decoded string, non-ASCII

>>> S

'AÄBèC'

>>> len(S)

5

Manual encoding

As we’ve already learned, we can always encode such a string to raw bytes according to the target encoding name:

# Encode manually with methods

>>> L = S.encode('latin-1') # 5 bytes when encoded as latin-1

>>> L

b'A\xc4B\xe8C'

>>> len(L)

5

>>> U = S.encode('utf-8') # 7 bytes when encoded as utf-8

>>> U

b'A\xc3\x84B\xc3\xa8C'

>>> len(U)

7

File output encoding

Now, to write our string to a text file in a particular encoding, we can simply pass the desired encoding name to open—although we could manually encode first and write in binary mode, there’s no need to:

# Encoding automatically when written

>>> open('latindata', 'w', encoding='latin-1').write(S) # Write as latin-1

5

>>> open('utf8data', 'w', encoding='utf-8').write(S) # Write as utf-8

5

>>> open('latindata', 'rb').read() # Read raw bytes

b'A\xc4B\xe8C'

>>> open('utf8data', 'rb').read() # Different in files

b'A\xc3\x84B\xc3\xa8C'

File input decoding

Similarly, to read arbitrary Unicode data, we simply pass in the file’s encoding type name to open, and it decodes from raw bytes to strings automatically; we could read raw bytes and decode manually too, but that can be tricky when reading in blocks (we might read an incomplete character), and it isn’t necessary:

# Decoding automatically when read

>>> open('latindata', 'r', encoding='latin-1').read() # Decoded on input

'AÄBèC'

>>> open('utf8data', 'r', encoding='utf-8').read() # Per encoding type

'AÄBèC'

>>> X = open('latindata', 'rb').read() # Manual decoding:

>>> X.decode('latin-1') # Not necessary

'AÄBèC'

>>> X = open('utf8data', 'rb').read()

>>> X.decode() # UTF-8 is default

'AÄBèC'

Decoding mismatches

Finally, keep in mind that this behavior of files in 3.X limits the kind of content you can load as text. As suggested in the prior section, Python 3.X really must be able to decode the data in text files into a str string, according to either the default or a passed-in Unicode encoding name. Trying to open a truly binary data file in text mode, for example, is unlikely to work in 3.X even if you use the correct object types:

>>> file = open(r'C:\Python33\python.exe', 'r')

>>> text = file.read()

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2: ...

>>> file = open(r'C:\Python33\python.exe', 'rb')

>>> data = file.read()

>>> data[:20]

b'MZ\x90\x00\x03\x00\x00\x00\x04\x00\x00\x00\xff\xff\x00\x00\xb8\x00\x00\x00'

The first of these examples might not fail in Python 2.X (normal files do not decode text), even though it probably should: reading the file may return corrupted data in the string, due to automatic end-of-line translations in text mode (any embedded \r\n bytes will be translated to \n on Windows when read). To treat file content as Unicode text in 2.X, we need to use special tools instead of the general open built-in function, as we’ll see in a moment. First, though, let’s turn to a more explosive topic.

Handling the BOM in 3.X

As described earlier in this chapter, some encoding schemes store a special byte order marker (BOM) sequence at the start of files, to specify data endianness (which end of a string of bits is most significant to its value) or declare the encoding type. Python both skips this marker on input and writes it on output if the encoding name implies it, but we sometimes must use a specific encoding name to force BOM processing explicitly.

For example, in the UTF-16 and UTF-32 encodings, the BOM specifies big- or little-endian format. A UTF-8 text file may also include a BOM, but this isn’t guaranteed, and serves only to declare that it is UTF-8 in general. When reading and writing data using these encoding schemes, Python automatically skips or writes the BOM if it is either implied by a general encoding name, or if you provide a more specific encoding name to force the issue. For instance:

§ In UTF-16, the BOM is always processed for “utf-16,” and the more specific encoding name “utf-16-le” denotes little-endian format.

§ In UTF-8, the more specific encoding “utf-8-sig” forces Python to both skip and write a BOM on input and output, respectively, but the general “utf-8” does not.

Dropping the BOM in Notepad

Let’s make some files with BOMs to see how this works in practice. When you save a text file in Windows Notepad, you can specify its encoding type in a drop-down list—simple ASCII text, UTF-8, or little- or big-endian UTF-16. If a two-line text file named spam.txt is saved in Notepad as the encoding type ANSI, for instance, it’s written as simple ASCII text without a BOM. When this file is read in binary mode in Python, we can see the actual bytes stored in the file. When it’s read as text, Python performs end-of-line translation by default; we can also decode it as explicit UTF-8 text since ASCII is a subset of this scheme (and UTF-8 is Python 3.X’s default encoding):

C:\code> C:\python33\python # File saved in Notepad

>>> import sys

>>> sys.getdefaultencoding()

'utf-8'

>>> open('spam.txt', 'rb').read() # ASCII (UTF-8) text file

b'spam\r\nSPAM\r\n'

>>> open('spam.txt', 'r').read() # Text mode translates line end

'spam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-8').read()

'spam\nSPAM\n'

If this file is instead saved as UTF-8 in Notepad, it is prepended with a 3-byte UTF-8 BOM sequence, and we need to give a more specific encoding name (“utf-8-sig”) to force Python to skip the marker:

>>> open('spam.txt', 'rb').read() # UTF-8 with 3-byte BOM

b'\xef\xbb\xbfspam\r\nSPAM\r\n'

>>> open('spam.txt', 'r').read()

'spam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-8').read()

'\ufeffspam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-8-sig').read()

'spam\nSPAM\n'

If the file is stored as Unicode big endian in Notepad, we get UTF-16-format data in the file, with 2-byte (16-bit) characters prepended with a 2-byte BOM sequence—the encoding name “utf-16” in Python skips the BOM because it is implied (since all UTF-16 files have a BOM), and “utf-16-be” handles the big-endian format but does not skip the BOM (the second of the following fails to print on older Pythons):

>>> open('spam.txt', 'rb').read()

b'\xfe\xff\x00s\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n'

>>> open('spam.txt', 'r').read()

'\xfeÿ\x00s\x00p\x00a\x00m\x00\n\x00\n\x00S\x00P\x00A\x00M\x00\n\x00\n'

>>> open('spam.txt', 'r', encoding='utf-16').read()

'spam\nSPAM\n'

>>> open('spam.txt', 'r', encoding='utf-16-be').read()

'\ufeffspam\nSPAM\n'

Notepad’s “Unicode,” by the way, is UTF-16 little endian (which, of course, is one of very many kinds of Unicode encoding!).

Dropping the BOM in Python

The same patterns generally hold true for output. When writing a Unicode file in Python code, we need a more explicit encoding name to force the BOM in UTF-8—“utf-8” does not write (or skip) the BOM, but “utf-8-sig” does:

>>> open('temp.txt', 'w', encoding='utf-8').write('spam\nSPAM\n')

10

>>> open('temp.txt', 'rb').read() # No BOM

b'spam\r\nSPAM\r\n'

>>> open('temp.txt', 'w', encoding='utf-8-sig').write('spam\nSPAM\n')

10

>>> open('temp.txt', 'rb').read() # Wrote BOM

b'\xef\xbb\xbfspam\r\nSPAM\r\n'

>>> open('temp.txt', 'r').read()

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8').read() # Keeps BOM

'\ufeffspam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8-sig').read() # Skips BOM

'spam\nSPAM\n'

Notice that although “utf-8” does not drop the BOM, data without a BOM can be read with both “utf-8” and “utf-8-sig”—use the latter for input if you’re not sure whether a BOM is present in a file (and don’t read this paragraph out loud in an airport security line!):

>>> open('temp.txt', 'w').write('spam\nSPAM\n')

10

>>> open('temp.txt', 'rb').read() # Data without BOM

b'spam\r\nSPAM\r\n'

>>> open('temp.txt', 'r').read() # Either utf-8 works

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8').read()

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-8-sig').read()

'spam\nSPAM\n'

Finally, for the encoding name “utf-16,” the BOM is handled automatically: on output, data is written in the platform’s native endianness, and the BOM is always written; on input, data is decoded per the BOM, and the BOM is always stripped because it’s standard in this scheme:

>>> sys.byteorder

'little'

>>> open('temp.txt', 'w', encoding='utf-16').write('spam\nSPAM\n')

10

>>> open('temp.txt', 'rb').read()

b'\xff\xfes\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n\x00'

>>> open('temp.txt', 'r', encoding='utf-16').read()

'spam\nSPAM\n'

More specific UTF-16 encoding names can specify different endianness, though you may have to manually write and skip the BOM yourself in some scenarios if it is required or present—study the following examples for more BOM-making instructions:

>>> open('temp.txt', 'w', encoding='utf-16-be').write('\ufeffspam\nSPAM\n')

11

>>> open('spam.txt', 'rb').read()

b'\xfe\xff\x00s\x00p\x00a\x00m\x00\r\x00\n\x00S\x00P\x00A\x00M\x00\r\x00\n'

>>> open('temp.txt', 'r', encoding='utf-16').read()

'spam\nSPAM\n'

>>> open('temp.txt', 'r', encoding='utf-16-be').read()

'\ufeffspam\nSPAM\n'

The more specific UTF-16 encoding names work fine with BOM-less files, though “utf-16” requires one on input in order to determine byte order:

>>> open('temp.txt', 'w', encoding='utf-16-le').write('SPAM')

4

>>> open('temp.txt', 'rb').read() # OK if BOM not present or expected

b'S\x00P\x00A\x00M\x00'

>>> open('temp.txt', 'r', encoding='utf-16-le').read()

'SPAM'

>>> open('temp.txt', 'r', encoding='utf-16').read()

UnicodeError: UTF-16 stream does not start with BOM

Experiment with these encodings yourself or see Python’s library manuals for more details on the BOM.

Unicode Files in 2.X

The preceding discussion applies to Python 3.X’s string types and files. You can achieve similar effects for Unicode files in 2.X, but the interface is different. However, if you replace str with unicode and open with codecs.open, the result is essentially the same in 3.X:

C:\code> C:\python27\python

>>> S = u'A\xc4B\xe8C' # 2.X type

>>> print S

AÄBèC

>>> len(S)

5

>>> S.encode('latin-1') # Manual calls

'A\xc4B\xe8C'

>>> S.encode('utf-8')

'A\xc3\x84B\xc3\xa8C'

>>> import codecs # 2.X files

>>> codecs.open('latindata', 'w', encoding='latin-1').write(S) # Writes encode

>>> codecs.open('utfdata', 'w', encoding='utf-8').write(S)

>>> open('latindata', 'rb').read()

'A\xc4B\xe8C'

>>> open('utfdata', 'rb').read()

'A\xc3\x84B\xc3\xa8C'

>>> codecs.open('latindata', 'r', encoding='latin-1').read() # Reads decode

u'A\xc4B\xe8C'

>>> codecs.open('utfdata', 'r', encoding='utf-8').read()

u'A\xc4B\xe8C'

>>> print codecs.open('utfdata', 'r', encoding='utf-8').read() # Print to view

AÄBèC

For more 2.X Unicode details, see earlier sections of this chapter and Python 2.X manuals.

Unicode Filenames and Streams

In closing, this section has focused on the encoding and decoding of Unicode text file content, but Python also supports the notion of non-ASCII file names. In fact, they are independent settings in sys, which can vary per Python version and platform (2.X returns ASCII for the first of the following on Windows):

>>> import sys

>>> sys.getdefaultencoding(), sys.getfilesystemencoding() # File content, names

('utf-8', 'mbcs')

Filenames: Text versus bytes

Filename encoding is often a nonissue. In short, for filenames given as Unicode text strings, the open call encodes automatically to and from the underlying platform’s filename conventions. Passing arbitrarily pre-encoded filenames as byte strings to file tools (including open and directory walkers and listers) overrides automatic encodings, and forces filename results to be returned in encoded byte string form too—useful if filenames are undecodable per the underlying platform’s conventions (I’m using Windows, but some of the following may fail on other platforms):

>>> f = open('xxx\u00A5', 'w') # Non-ASCII filename

>>> f.write('\xA5999\n') # Writes five characters

>>> f.close()

>>> print(open('xxx\u00A5').read()) # Text: auto-encoded

¥999

>>> print(open(b'xxx\xA5').read()) # Bytes: pre-encoded

¥999

>>> import glob # Filename expansion tool

>>> glob.glob('*\u00A5*') # Get decoded text for decoded text

['xxx¥']

>>> glob.glob(b'*\xA5*') # Get encoded bytes for encoded bytes

[b'xxx\xa5']

Stream content: PYTHONIOENCODING

In addition, the environment variable PYTHONIOENCODING can be used to set the encoding used for text in the standard streams—input, output, and error. This setting overrides Python’s default encoding for printed text, which on Windows currently uses a Windows format on 3.X and ASCII on 2.X. Setting this to a general Unicode format like UTF-8 may sometimes be required to print non-ASCII text, and to display such text in shell windows (possibly in conjunction with code page changes on some Windows machines). A script that prints non-ASCII filenames, for example, may fail unless this setting is made.

For more background on this subject, see also “Currency Symbols: Unicode in Action” in Chapter 25. There, we work through an example that demonstrates the essentials of portable Unicode coding, as well as the roles and requirements of PYTHONIOENCODING settings, which we won’t rehash here.

For more on these topics in general, see Python manuals or books such as Programming Python, 4th Edition (or later, if later may be). The latter of these digs deeper into streams and files from an applications-level perspective.

Other String Tool Changes in 3.X

Many of the other popular string-processing tools in Python’s standard library have also been revamped for the new str/bytes type dichotomy. We won’t cover any of these application-focused tools in much detail in this core language book, but to wrap up this chapter, here’s a quick look at four of the major tools impacted: the re pattern-matching module, the struct binary data module, the pickle object serialization module, and the xml package for parsing XML text. As noted ahead, other Python tools, such as its json module, differ in ways similar to those presented here.

The re Pattern-Matching Module

Python’s re pattern-matching module supports text processing that is more general than that afforded by simple string method calls such as find, split, and replace. With re, strings that designate searching and splitting targets can be described by general patterns, instead of absolute text. This module has been generalized to work on objects of any string type in 3.X—str, bytes, and bytearray—and returns result substrings of the same type as the subject string. In 2.X it supports both unicode and str.

Here it is at work in 3.X, extracting substrings from a line of text—borrowed, of course, from Monty Python’s The Meaning of Life. Within pattern strings, (.*) means any character (the .), zero or more times (the *), saved away as a matched substring (the ()). Parts of the string matched by the parts of a pattern enclosed in parentheses are available after a successful match, via the group or groups method:

C:\code> C:\python33\python

>>> import re

>>> S = 'Bugger all down here on earth!' # Line of text

>>> B = b'Bugger all down here on earth!' # Usually from a file

>>> re.match('(.*) down (.*) on (.*)', S).groups() # Match line to pattern

('Bugger all', 'here', 'earth!') # Matched substrings

>>> re.match(b'(.*) down (.*) on (.*)', B).groups() # bytes substrings

(b'Bugger all', b'here', b'earth!')

In Python 2.X results are similar, but the unicode type is used for non-ASCII text, and str handles both 8-bit and binary text:

C:\code> C:\python27\python

>>> import re

>>> S = 'Bugger all down here on earth!' # Simple text and binary

>>> U = u'Bugger all down here on earth!' # Unicode text

>>> re.match('(.*) down (.*) on (.*)', S).groups()

('Bugger all', 'here', 'earth!')

>>> re.match('(.*) down (.*) on (.*)', U).groups()

(u'Bugger all', u'here', u'earth!')

Since bytes and str support essentially the same operation sets, this type distinction is largely transparent. But note that, like in other APIs, you can’t mix str and bytes types in its calls’ arguments in 3.X (although if you don’t plan to do pattern matching on binary data, you probably don’t need to care):

C:\code> C:\python33\python

>>> import re

>>> S = 'Bugger all down here on earth!'

>>> B = b'Bugger all down here on earth!'

>>> re.match('(.*) down (.*) on (.*)', B).groups()

TypeError: can't use a string pattern on a bytes-like object

>>> re.match(b'(.*) down (.*) on (.*)', S).groups()

TypeError: can't use a bytes pattern on a string-like object

>>> re.match(b'(.*) down (.*) on (.*)', bytearray(B)).groups()

(bytearray(b'Bugger all'), bytearray(b'here'), bytearray(b'earth!'))

>>> re.match('(.*) down (.*) on (.*)', bytearray(B)).groups()

TypeError: can't use a string pattern on a bytes-like object

The struct Binary Data Module

The Python struct module, used to create and extract packed binary data from strings, also works the same in 3.X as it does in 2.X, but in 3.X packed data is represented as bytes and bytearray objects only, not str objects (which makes sense, given that it’s intended for processing binary data, not decoded text); and “s” data code values must be bytes as of 3.2 (the former str UTF-8 auto-encode is dropped).

Here are both Pythons in action, packing three objects into a string according to a binary type specification (they create a 4-byte integer, a 4-byte string, and a 2-byte integer):

C:\code> C:\python33\python

>>> from struct import pack

>>> pack('>i4sh', 7, b'spam', 8) # bytes in 3.X (8-bit strings)

b'\x00\x00\x00\x07spam\x00\x08'

C:\code> C:\python27\python

>>> from struct import pack

>>> pack('>i4sh', 7, 'spam', 8) # str in 2.X (8-bit strings)

'\x00\x00\x00\x07spam\x00\x08'

Since bytes has an almost identical interface to that of str in 3.X and 2.X, though, most programmers probably won’t need to care—the change is irrelevant to most existing code, especially since reading from a binary file creates a bytes automatically. Although the last test in the following example fails on a type mismatch, most scripts will read binary data from a file, not create it as a string as we do here:

C:\code> C:\python33\python

>>> import struct

>>> B = struct.pack('>i4sh', 7, b'spam', 8)

>>> B

b'\x00\x00\x00\x07spam\x00\x08'

>>> vals = struct.unpack('>i4sh', B)

>>> vals

(7, b'spam', 8)

>>> vals = struct.unpack('>i4sh', B.decode())

TypeError: 'str' does not support the buffer interface

Apart from the new syntax for bytes, creating and reading binary files works almost the same in 3.X as it does in 2.X. Still, code like this is one of the main places where programmers will notice the bytes object type:

C:\code> C:\python33\python

# Write values to a packed binary file

>>> F = open('data.bin', 'wb') # Open binary output file

>>> import struct

>>> data = struct.pack('>i4sh', 7, b'spam', 8) # Create packed binary data

>>> data # bytes in 3.X, not str

b'\x00\x00\x00\x07spam\x00\x08'

>>> F.write(data) # Write to the file

10

>>> F.close()

# Read values from a packed binary file

>>> F = open('data.bin', 'rb') # Open binary input file

>>> data = F.read() # Read bytes

>>> data

b'\x00\x00\x00\x07spam\x00\x08'

>>> values = struct.unpack('>i4sh', data) # Extract packed binary data

>>> values # Back to Python objects

(7, b'spam', 8)

Once you’ve extracted packed binary data into Python objects like this, you can dig even further into the binary world if you have to—strings can be indexed and sliced to get individual bytes’ values, individual bits can be extracted from integers with bitwise operators, and so on (see earlier in this book for more on the operations applied here):

>>> values # Result of struct.unpack

(7, b'spam', 8)

# Accessing bits of parsed integers

>>> bin(values[0]) # Can get to bits in ints

'0b111'

>>> values[0] & 0x01 # Test first (lowest) bit in int

1

>>> values[0] | 0b1010 # Bitwise or: turn bits on

15

>>> bin(values[0] | 0b1010) # 15 decimal is 1111 binary

'0b1111'

>>> bin(values[0] ^ 0b1010) # Bitwise xor: off if both true

'0b1101'

>>> bool(values[0] & 0b100) # Test if bit 3 is on

True

>>> bool(values[0] & 0b1000) # Test if bit 4 is set

False

Since parsed bytes strings are sequences of small integers, we can do similar processing with their individual bytes:

# Accessing bytes of parsed strings and bits within them

>>> values[1]

b'spam'

>>> values[1][0] # bytes string: sequence of ints

115

>>> values[1][1:] # Prints as ASCII characters

b'pam'

>>> bin(values[1][0]) # Can get to bits of bytes in strings

'0b1110011'

>>> bin(values[1][0] | 0b1100) # Turn bits on

'0b1111111'

>>> values[1][0] | 0b1100

127

Of course, most Python programmers don’t deal with binary bits; Python has higher-level object types, like lists and dictionaries that are generally a better choice for representing information in Python scripts. However, if you must use or produce lower-level data used by C programs, networking libraries, or other interfaces, Python has tools to assist.

The pickle Object Serialization Module

We met the pickle module briefly in Chapter 9, Chapter 28, and Chapter 31. In Chapter 28, we also used the shelve module, which uses pickle internally. For completeness here, keep in mind that the Python 3.X version of the pickle module always creates a bytes object, regardless of the default or passed-in “protocol” (data format level). You can see this by using the module’s dumps call to return an object’s pickle string:

C:\code> C:\python33\python

>>> import pickle # dumps() returns pickle string

>>> pickle.dumps([1, 2, 3]) # Python 3.X default protocol=3=binary

b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

>>> pickle.dumps([1, 2, 3], protocol=0) # ASCII protocol 0, but still bytes!

b'(lp0\nL1L\naL2L\naL3L\na.'

This implies that files used to store pickled objects must always be opened in binary mode in Python 3.X, since text files use str strings to represent data, not bytes—the dump call simply attempts to write the pickle string to an open output file:

>>> pickle.dump([1, 2, 3], open('temp', 'w')) # Text files fail on bytes!

TypeError: must be str, not bytes # Despite protocol value

>>> pickle.dump([1, 2, 3], open('temp', 'w'), protocol=0)

TypeError: must be str, not bytes

>>> pickle.dump([1, 2, 3], open('temp', 'wb')) # Always use binary in 3.X

>>> open('temp', 'r').read() # This works, but just by luck

'\u20ac\x03]q\x00(K\x01K\x02K\x03e.'

Notice the last result here didn’t issue an error in text mode only because the stored binary data was compatible with the Windows platform’s UTF-8 default decoder; this was really just luck (and in fact, this command failed when printing in older Pythons, and may fail on other platforms). Because pickle data is not generally decodable Unicode text, the same rule holds on input—correct usage in 3.X requires always both writing and reading pickle data in binary modes, whether unpickling or not:

>>> pickle.dump([1, 2, 3], open('temp', 'wb'))

>>> pickle.load(open('temp', 'rb'))

[1, 2, 3]

>>> open('temp', 'rb').read()

b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

In Python 2.X, we can get by with text-mode files for pickled data, as long as the protocol is level 0 (the default in 2.X) and we use text mode consistently to convert line ends:

C:\code> C:\python27\python

>>> import pickle

>>> pickle.dumps([1, 2, 3]) # Python 2.X default=0=ASCII

'(lp0\nI1\naI2\naI3\na.'

>>> pickle.dumps([1, 2, 3], protocol=1)

']q\x00(K\x01K\x02K\x03e.'

>>> pickle.dump([1, 2, 3], open('temp', 'w')) # Text mode works in 2.X

>>> pickle.load(open('temp'))

[1, 2, 3]

>>> open('temp').read()

'(lp0\nI1\naI2\naI3\na.'

If you care about version neutrality, though, or don’t want to care about protocols or their version-specific defaults, always use binary-mode files for pickled data—the following works the same in Python 3.X and 2.X:

>>> import pickle

>>> pickle.dump([1, 2, 3], open('temp', 'wb')) # Version neutral

>>> pickle.load(open('temp', 'rb')) # And required in 3.X

[1, 2, 3]

Because almost all programs let Python pickle and unpickle objects automatically and do not deal with the content of pickled data itself, the requirement to always use binary file modes is the only significant incompatibility in Python 3.X’s newer pickling model. See reference books or Python’s manuals for more details on object pickling.

XML Parsing Tools

XML is a tag-based language for defining structured information, commonly used to define documents and data shipped over the Web. Although some information can be extracted from XML text with basic string methods or the re pattern module, XML’s nesting of constructs and arbitrary attribute text tend to make full parsing more accurate.

Because XML is such a pervasive format, Python itself comes with an entire package of XML parsing tools that support the SAX and DOM parsing models, as well as a package known as ElementTree—a Python-specific API for parsing and constructing XML. Beyond basic parsing, the open source domain provides support for additional XML tools, such as XPath, Xquery, XSLT, and more.

XML by definition represents text in Unicode form, to support internationalization. Although most of Python’s XML parsing tools have always returned Unicode strings, in Python 3.X their results have mutated from the 2.X unicode type to the 3.X general str string type—which makes sense, given that 3.X’s str string is Unicode, whether the encoding is ASCII or other.

We can’t go into many details here, but to sample the flavor of this domain, suppose we have a simple XML text file, mybooks.xml:

<books>

<date>1995~2013</date>

<title>Learning Python</title>

<title>Programming Python</title>

<title>Python Pocket Reference</title>

<publisher>O'Reilly Media</publisher>

</books>

and we want to run a script to extract and display the content of all the nested title tags, as follows:

Learning Python

Programming Python

Python Pocket Reference

There are at least four basic ways to accomplish this (not counting more advanced tools like XPath). First, we could run basic pattern matching on the file’s text, though this tends to be inaccurate if the text is unpredictable. Where applicable, the re module we met earlier does the job—itsmatch method looks for a match at the start of a string, search scans ahead for a match, and the findall method used here locates all places where the pattern matches in the string (the result comes back as a list of matched substrings corresponding to parenthesized pattern groups, or tuples of such for multiple groups):

# File patternparse.py

import re

text = open('mybooks.xml').read()

found = re.findall('<title>(.*)</title>', text)

for title in found: print(title)

Second, to be more robust, we could perform complete XML parsing with the standard library’s DOM parsing support. DOM parses XML text into a tree of objects and provides an interface for navigating the tree to extract tag attributes and values; the interface is a formal specification, independent of Python:

# File domparse.py

from xml.dom.minidom import parse, Node

xmltree = parse('mybooks.xml')

for node1 in xmltree.getElementsByTagName('title'):

for node2 in node1.childNodes:

if node2.nodeType == Node.TEXT_NODE:

print(node2.data)

As a third option, Python’s standard library supports SAX parsing for XML. Under the SAX model, a class’s methods receive callbacks as a parse progresses and use state information to keep track of where they are in the document and collect its data:

# File saxparse.py

import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

def __init__(self):

self.inTitle = False

def startElement(self, name, attributes):

if name == 'title':

self.inTitle = True

def characters(self, data):

if self.inTitle:

print(data)

def endElement(self, name):

if name == 'title':

self.inTitle = False

import xml.sax

parser = xml.sax.make_parser()

handler = BookHandler()

parser.setContentHandler(handler)

parser.parse('mybooks.xml')

Finally, the ElementTree system available in the etree package of the standard library can often achieve the same effects as XML DOM parsers, but with remarkably less code. It’s a Python-specific way to both parse and generate XML text; after a parse, its API gives access to components of the document:

# File etreeparse.py

from xml.etree.ElementTree import parse

tree = parse('mybooks.xml')

for E in tree.findall('title'):

print(E.text)

When run in either 2.X or 3.X, all four of these scripts display the same printed result:

C:\code> C:\python27\python domparse.py

Learning Python

Programming Python

Python Pocket Reference

C:\code> C:\python33\python domparse.py

Learning Python

Programming Python

Python Pocket Reference

Technically, though, in 2.X some of these scripts produce unicode string objects, while in 3.X all produce str strings, since that type includes Unicode text (whether ASCII or other):

C:\code> C:\python33\python

>>> from xml.dom.minidom import parse, Node

>>> xmltree = parse('mybooks.xml')

>>> for node in xmltree.getElementsByTagName('title'):

for node2 in node.childNodes:

if node2.nodeType == Node.TEXT_NODE:

node2.data

'Learning Python'

'Programming Python'

'Python Pocket Reference'

C:\code> C:\python27\python

>>> ...same code...

u'Learning Python'

u'Programming Python'

u'Python Pocket Reference'

Programs that must deal with XML parsing results in nontrivial ways will need to account for the different object type in 3.X. Again, though, because all strings have nearly identical interfaces in both 2.X and 3.X, most scripts won’t be affected by the change; tools available on unicode in 2.X are generally available on str in 3.X. The major feat, if there is one, is likely in getting the encoding names right when transferring the parsed-out data to and from files, network connections, GUIs, and so on.

Regrettably, going into further XML parsing details is beyond this book’s scope. If you are interested in text or XML parsing, it is covered in more detail in the applications-focused follow-up book Programming Python. For more details on re, struct, pickle, and XML, as well as the additional impacts of Unicode on other library tools such as filename expansion and directory walkers, consult the Web, the aforementioned book and others, and Python’s standard library manual.

For a related topic, see also the JSON example in Chapter 9—a language-neutral data exchange format, whose structure is very similar to Python dictionaries and lists, and whose strings are all Unicode that differs in type between Pythons 2.X and 3.X much the same as shown for XML here.

WHY YOU WILL CARE: INSPECTING FILES, AND MUCH MORE

As I was updating this chapter, I stumbled onto a use case for some of its tools. After saving a formerly ASCII HTML file in Notepad as “UTF8,” I found that it had grown a mystery non-ASCII character along the way due to an apparent keyboard operator error, and would no longer work as ASCII in text tools. To find the bad character, I simply started Python, decoded the file’s content from its UTF-8 format via a text mode file, and scanned character by character looking for the first byte that was not a valid ASCII character too:

>>> f = open('py33-windows-launcher.html', encoding='utf8')

>>> t = f.read()

>>> for (i, c) in enumerate(t):

try:

x = c.encode(encoding='ascii')

except:

print(i, sys.exc_info()[0])

9886 <class 'UnicodeEncodeError'>

With the bad character’s index in hand, it’s easy to slice the Unicode string for more details:

>>> len(t)

31021

>>> t[9880:9890]

'ugh. \u206cThi'

>>> t[9870:9890]

'trace through. \u206cThi'

After fixing, I could also open in binary mode to verify and explore actual undecoded file content further:

>>> f = open('py33-windows-launcher.html', 'rb')

>>> b = f.read()

>>> b[0]

60

>>> b[:10]

b'<HTML>\r\n<T'

Not rocket science, perhaps, and there are other approaches, but Python makes for a convenient tactical tool in such cases, and its file objects give you a tangible window on your data when needed, both in scripts and interactive mode.

For more realistically scaled examples of Unicode at work, I suggest my other book Programming Python, 4th Edition (or later). That book develops much larger programs than we can here, and has numerous up close and personal encounters with Unicode along the way, in the context of files, directory walkers, network sockets, GUIs, email content and headers, web page content, databases, and more. Though clearly an important topic in today’s global software world, Unicode is more mandatory than you might expect, especially in a language like Python 3.X, which elevates it to its core string and file types, thus bringing all its users into the Unicode fold—ready or not!

Chapter Summary

This chapter explored in-depth the advanced string types available in Python 3.X and 2.X for processing Unicode text and binary data. As we saw, many programmers use ASCII text and can get by with the basic string type and its operations. For more advanced applications, Python’s string models fully support both richer Unicode text (via the normal string type in 3.X and a special type in 2.X) and byte-oriented data (represented with a bytes type in 3.X and normal strings in 2.X).

In addition, we learned how Python’s file object has mutated in 3.X to automatically encode and decode Unicode text and deal with byte strings for binary-mode files, and saw similar utility for 2.X. Finally, we briefly met some text and binary data tools in Python’s library, and sampled their behavior in 3.X and 2.X.

In the next chapter, we’ll shift our focus to tool-builder topics, with a look at ways to manage access to object attributes by inserting automatically run code. Before we move on, though, here’s a set of questions to review what we’ve learned here. This has been a substantial chapter, so be sure to read the quiz answers eventually for a more in-depth summary.

Test Your Knowledge: Quiz

1. What are the names and roles of string object types in Python 3.X?

2. What are the names and roles of string object types in Python 2.X?

3. What is the mapping between 2.X and 3.X string types?

4. How do Python 3.X’s string types differ in terms of operations?

5. How can you code non-ASCII Unicode characters in a string in 3.X?

6. What are the main differences between text- and binary-mode files in Python 3.X?

7. How would you read a Unicode text file that contains text in a different encoding than the default for your platform?

8. How can you create a Unicode text file in a specific encoding format?

9. Why is ASCII text considered to be a kind of Unicode text?

10.How large an impact does Python 3.X’s string types change have on your code?

Test Your Knowledge: Answers

1. Python 3.X has three string types: str (for Unicode text, including ASCII), bytes (for binary data with absolute byte values), and bytearray (a mutable flavor of bytes). The str type usually represents content stored on a text file, and the other two types generally represent content stored on binary files.

2. Python 2.X has two main string types: str (for 8-bit text and binary data) and unicode (for possibly wider character Unicode text). The str type is used for both text and binary file content; unicode is used for text file content that is generally more complex than 8-bit characters. Python 2.6 (but not earlier) also has 3.X’s bytearray type, but it’s mostly a back-port and doesn’t exhibit the sharp text/binary distinction that it does in 3.X.

3. The mapping from 2.X to 3.X string types is not direct, because 2.X’s str equates to both str and bytes in 3.X, and 3.X’s str equates to both str and unicode in 2.X. The mutability of bytearray in 3.X is also unique. In general, though: Unicode text is handled by 3.X str and 2.X unicode, byte-based data is handled by 3.X bytes and 2.X str, and 3.X bytes and 2.X str can both handle some simpler types of text.

4. Python 3.X’s string types share almost all the same operations: method calls, sequence operations, and even larger tools like pattern matching work the same way. On the other hand, only str supports string formatting operations, and bytearray has an additional set of operations that perform in-place changes. The str and bytes types also have methods for encoding and decoding text, respectively.

5. Non-ASCII Unicode characters can be coded in a string with both hex (\xNN) and Unicode (\uNNNN, \UNNNNNNNN) escapes. On some machines, some non-ASCII characters—certain Latin-1 characters, for example—can also be typed or pasted directly into code, and are interpreted per the UTF-8 default or a source code encoding directive comment.

6. In 3.X, text-mode files assume their file content is Unicode text (even if it’s all ASCII) and automatically decode when reading and encode when writing. With binary-mode files, bytes are transferred to and from the file unchanged. The contents of text-mode files are usually represented as str objects in your script, and the contents of binary files are represented as bytes (or bytearray) objects. Text-mode files also handle the BOM for certain encoding types and automatically translate end-of-line sequences to and from the single \n character on input and output unless this is explicitly disabled; binary-mode files do not perform either of these steps. Python 2.X uses codecs.open for Unicode files, which encodes and decodes similarly; 2.X’s open only translates line ends in text mode.

7. To read files encoded in a different encoding than the default for your platform, simply pass the name of the file’s encoding to the open built-in in 3.X (codecs.open() in 2.X); data will be decoded per the specified encoding when it is read from the file. You can also read in binary mode and manually decode the bytes to a string by giving an encoding name, but this involves extra work and is somewhat error-prone for multibyte characters (you may accidentally read a partial character sequence).

8. To create a Unicode text file in a specific encoding format, pass the desired encoding name to open in 3.X (codecs.open() in 2.X); strings will be encoded per the desired encoding when they are written to the file. You can also manually encode a string to bytes and write it in binary mode, but this is usually extra work.

9. ASCII text is considered to be a kind of Unicode text, because its 7-bit range of values is a subset of most Unicode encodings. For example, valid ASCII text is also valid Latin-1 text (Latin-1 simply assigns the remaining possible values in an 8-bit byte to additional characters) and valid UTF-8 text (UTF-8 defines a variable-byte scheme for representing more characters, but ASCII characters are still represented with the same codes, in a single byte). This makes Unicode backward-compatible with the mass of ASCII text data in the world (though it also may have limited its options—self-identifying text, for instance, may have been difficult (though BOMs serve much the same role).

10.The impact of Python 3.X’s string types change depends upon the types of strings you use. For scripts that use simple ASCII text on platforms with ASCII-compatible default encodings, the impact is probably minor: the str string type works the same in 2.X and 3.X in this case. Moreover, although string-related tools in the standard library such as re, struct, pickle, and xml may technically use different types in 3.X than in 2.X, the changes are largely irrelevant to most programs because 3.X’s str and bytes and 2.X’s str support almost identical interfaces. If you process Unicode data, the toolset you need has simply moved from 2.X’s unicode and codecs.open() to 3.X’s str and open. If you deal with binary data files, you’ll need to deal with content as bytes objects; since they have a similar interface to 2.X strings, though, the impact should again be minimal. That said, the update of the book Programming Python for 3.X ran across numerous cases where Unicode’s mandatory status in 3.X implied changes in standard library APIs—from networking and GUIs, to databases and email. In general, Unicode will probably impact most 3.X users eventually.