Working with Data - Writing Idiomatic Python (2013)

Writing Idiomatic Python (2013)

5. Working with Data

5.1 Variables

5.1.1 Avoid using a temporary variable when performing a swap of two values

There is no reason to swap using a temporary variable in Python. We can use tuples to make our intention more clear.

5.1.1.1 Harmful

foo = 'Foo'

bar = 'Bar'

temp = foo

foo = bar

bar = temp

5.1.1.2 Idiomatic

foo = 'Foo'

bar = 'Bar'

(foo, bar) = (bar, foo)

5.2 Strings

5.2.1 Chain string functions to make a simple series of transformations more clear

When applying a simple sequence of transformations on some datum, chaining the calls in a single expression is often more clear than creating a temporary variable for each step of the transformation. Too much chaining, however, can make your code harder to follow. “No more than three chained functions” is a good rule of thumb.

5.2.1.1 Harmful

book_info = ' The Three Musketeers: Alexandre Dumas'

formatted_book_info = book_info.strip()

formatted_book_info = formatted_book_info.upper()

formatted_book_info = formatted_book_info.replace(':', ' by')

5.2.1.2 Idiomatic

book_info = ' The Three Musketeers: Alexandre Dumas'

formatted_book_info = book_info.strip().upper().replace(':', ' by')

5.2.2 Use ''.join when creating a single string for list elements

It’s faster, uses less memory, and you’ll see it everywhere anyway. Note that the two quotes represent the delimiter between list elements in the string we’re creating. '' just means we wish to concatenate the elements with no characters between them.

5.2.2.1 Harmful

result_list = ['True', 'False', 'File not found']

result_string = ''

for result in result_list:

result_string += result

5.2.2.2 Idiomatic

result_list = ['True', 'False', 'File not found']

result_string = ''.join(result_list)

5.2.3 Prefer the format function for formatting strings

There are three general ways of formatting strings (that is, creating a string that is a mix of hard-coded strings and string variables). Easily the worst approach is to use the + operator to concatenate a mix of static strings and variables. Using “old-style” string formatting is slightly better. It makes use of a format string and the % operator to fill in values, much like printf does in other languages.

The clearest and most idiomatic way to format strings is to use the format function. Like old-style formatting, it takes a format string and replaces placeholders with values. The similarities end there, though. With the format function, we can use named placeholders, access their attributes, and control padding and string width, among a number of other things. The format function makes string formatting clean and concise.

5.2.3.1 Harmful

def get_formatted_user_info_worst(user):

# Tedious to type and prone to conversion errors

return 'Name: ' + user.name + ', Age: ' + \

str(user.age) + ', Sex: ' + user.sex

def get_formatted_user_info_slightly_better(user):

# No visible connection between the format string placeholders

# and values to use. Also, why do I have to know the type?

# Don't these types all have __str__ functions?

return 'Name: %s, Age: %i, Sex: %c' % (

user.name, user.age, user.sex)

5.2.3.2 Idiomatic

def get_formatted_user_info(user):

# Clear and concise. At a glance I can tell exactly what

# the output should be. Note: this string could be returned

# directly, but the string itself is too long to fit on the

# page.

output = 'Name: {user.name}, Age: {user.age}, '

'Sex: {user.sex}'.format(user=user)

return output

5.3 Lists

5.3.1 Use a list comprehension to create a transformed version of an existing list

list comprehensions, when used judiciously, increase clarity in code that builds a list from existing data. This is especially true when elements are both checked for some condition and transformed in some way.

There are also (usually) performance benefits to using a list comprehension (or alternately, a generator expression) due to optimizations in the cPython interpreter.

5.3.1.1 Harmful

some_other_list = range(10)

some_list = list()

for element in some_other_list:

if is_prime(element):

some_list.append(element + 5)

5.3.1.2 Idiomatic

some_other_list = range(10)

some_list = [element + 5

for element in some_other_list

if is_prime(element)]

5.3.2 Prefer xrange to range unless you need the resulting list

Both xrange and range let you iterate over a list of numbers in a specified range. xrange, however, doesn’t store the entire list in memory. Much of the time this isn’t an issue. If you’re working with very large number ranges (or breaking out of a loop over a very large range), the memory and performance gains can be substantial.

5.3.2.1 Harmful

# A loop over a large range that breaks out

# early: a double whammy!

even_number = int()

for index in range (1000000):

if index % 2 == 0:

even_number = index

break

5.3.2.2 Idiomatic

even_number = int()

for index in xrange(1000000):

if index % 2 == 0:

even_number = index

break

5.4 Dictionaries

5.4.1 Use a dict as a substitute for a switch...case statement

Unlike many other languages, Python doesn’t have a switch...case construct. Typically, switch inspects the value of an expression and jumps to the case statement with the given value. It’s a shortcut for calling a single piece of code out of a number of possibilities based on a runtime value. For example, if we’re writing a command-line based calculator, a switch statement may be used on the operator typed by the user. “+” would call the addition() function, “*” the multiplication() function, and so on.

The naive alternative in Python is to write a series of if...else statements. This gets old quickly. Thankfully, functions are first-class objects in Python, so we can treat them the same as any other variable. This is a very powerful concept, and many other powerful concepts use first-class functions as a building block.

So how does this help us with switch...case statements? Rather than trying to emulate the exact functionality, we can take advantage of the fact that functions are first class object and can be stored as values in a dict. Returning to the calculator example, storing the string operator (e.g. “+”) as the key and it’s associated function as the value, we arrive at a clear, readable way to achieve the same functionality as switch...case.

This idiom is useful for more than just picking a function to dispatch using a string key. It can be generalized to anything that can be used as a dict key, which in Python is just about everything. Using this method, one could create a Factory class that chooses which type to instantiate via a parameter. Or it could be used to store states and their transitions when building a state machine. Once you fully appreciate the power of “everything is an object”, you’ll find elegant solutions to once-difficult problems.

5.4.1.1 Harmful

def apply_operation(left_operand, right_operand, operator):

if operator == '+':

return left_operand + right_operand

elif operator == '-':

return left_operand - right_operand

elif operator == '*':

return left_operand * right_operand

elif operator == '/':

return left_operand / right_operand

5.4.1.2 Idiomatic

def apply_operation(left_operand, right_operand, operator):

import operator as op

operator_mapper = {'+': op.add, '-': op.sub,

'*': op.mul, '/': op.truediv}

return operator_mapper[operator](left_operand, right_operand)

5.4.2 Use the default parameter of dict.get to provide default values

Often overlooked in the definition of dict.get is the default parameter. Without using default (or the collections.defaultdict class), your code will be littered with confusing if statements. Remember, strive for clarity.

5.4.2.1 Harmful

log_severity = None

if 'severity' in configuration:

log_severity = configuration['severity']

else:

log_severity = 'Info'

5.4.2.2 Idiomatic

log_severity = configuration.get('severity', 'Info')

5.4.3 Use a dict comprehension to build a dict clearly and efficiently

The list comprehension is a well-known Python construct. Less well known is the dict comprehension. Its purpose is identical: to construct a dict in place using the widely understood comprehension syntax.

5.4.3.1 Harmful

user_email = {}

for user in users_list:

if user.email:

user_email[user.name] = user.email

5.4.3.2 Idiomatic

user_email = {user.name: user.email

for user in users_list if user.email}

5.5 Sets

5.5.1 Understand and use the mathematical set operations

sets are an easy to understand data structure. Like a dict with keys but no values, the set class implements the Iterable and Container interfaces. Thus, a set can be used in a for loop or as the subject of an in statement.

For programmers who haven’t seen a Set data type before, it may appear to be of limited use. Key to understanding their usefulness is understanding their origin in mathematics. Set Theory is the branch of mathematics devoted to the study of sets. Understanding the basic mathematical set operations is the key to harnessing their power.

Don’t worry; you don’t need a degree in math to understand or use sets. You just need to remember a few simple operations:

Union

The set of elements in A, B, or both A and B (written A | B in Python).

Intersection

The set of elements in both A and B (written A & B in Python).

Difference

The set of elements in A but not B (written A - B in Python).

*Note: order matters here. A - B is not necessarily the same as B - A.

Symmetric Difference

The set of elements in either A or B, but not both A and B (written A ^ B in Python).

When working with lists of data, a common task is finding the elements that appear in all of the lists. Any time you need to choose elements from two or more sequences based on properties of sequence membership, look to use a set.

Below, we’ll explore some typical examples.

5.5.1.1 Harmful

def get_both_popular_and_active_users():

# Assume the following two functions each return a

# list of user names

most_popular_users = get_list_of_most_popular_users()

most_active_users = get_list_of_most_active_users()

popular_and_active_users = []

for user in most_active_users:

if user in most_popular_users:

popular_and_active_users.append(user)

return popular_and_active_users

5.5.1.2 Idiomatic

def get_both_popular_and_active_users():

# Assume the following two functions each return a

# list of user names

return(set(

get_list_of_most_active_users()) & set(

get_list_of_most_popular_users()))

5.5.2 Use a set comprehension to generate sets concisely

The set comprehension syntax is relatively new in Python and, therefore, often overlooked. Just as a list can be generated using a list comprehension, a set can be generated using a set comprehension. In fact, the syntax is nearly identical (modulo the enclosing characters).

5.5.2.1 Harmful

users_first_names = set()

for user in users:

users_first_names.add(user.first_name)

5.5.2.2 Idiomatic

users_first_names = {user.first_name for user in users}

5.5.3 Use sets to eliminate duplicate entries from Iterable containers

It’s quite common to have a list or dict with duplicate values. In a list of all surnames of employees at a large company, we’re bound to encounter common surnames more than once in the list. If we need a list of all the unique surnames, we can use a set to do the work for us. Three aspects of sets make them the perfect answer to our problem:

1. A set contains only unique elements

2. Adding an already existing element to a set is essentially “ignored”

3. A set can be built from any Iterable whose elements are hashable

Continuing the example, we may have an existing display function that accepts a sequence and displays its elements in one of many formats. After creating a set from our original list, will we need to change our display function?

Nope. Assuming our display function is implemented reasonably, our set can be used as a drop-in replacement for a list. This works thanks to the fact that a set, like a list, is an Iterable and can thus be used in a for loop, list comprehension, etc.

5.5.3.1 Harmful

unique_surnames = []

for surname in employee_surnames:

if surname not in unique_surnames:

unique_surnames.append(surname)

def display(elements, output_format='html'):

if output_format == 'std_out':

for element in elements:

print(element)

elif output_format == 'html':

as_html = '<ul>'

for element in elements:

as_html += '<li>{}</li>'.format(element)

return as_html + '</ul>'

else:

raise RuntimeError('Unknown format {}'.format(output_format))

5.5.3.2 Idiomatic

unique_surnames = set(employee_surnames)

def display(elements, output_format='html'):

if output_format == 'std_out':

for element in elements:

print(element)

elif output_format == 'html':

as_html = '<ul>'

for element in elements:

as_html += '<li>{}</li>'.format(element)

return as_html + '</ul>'

else:

raise RuntimeError('Unknown format {}'.format(output_format))

5.6 Tuples

5.6.1 Use _ as a placeholder for data in a tuple that should be ignored

When setting a tuple equal to some ordered data, oftentimes not all of the data is actually needed. Instead of creating throwaway variables with confusing names, use the _ as a placeholder to tell the reader, “This data will be discarded.”

5.6.1.1 Harmful

(name, age, temp, temp2) = get_user_info(user)

if age > 21:

output = '{name} can drink!'.format(name=name)

# "Wait, where are temp and temp2 being used?"

5.6.1.2 Idiomatic

(name, age, _, _) = get_user_info(user)

if age > 21:

output = '{name} can drink!'.format(name=name)

# "Clearly, only name and age are interesting"

5.6.2 Use tuples to unpack data

In Python, it is possible to “unpack” data for multiple assignment. Those familiar with LISP may know this as destructuring bind.

5.6.2.1 Harmful

list_from_comma_separated_value_file = ['dog', 'Fido', 10]

animal = list_from_comma_separated_value_file[0]

name = list_from_comma_separated_value_file[1]

age = list_from_comma_separated_value_file[2]

output = ('{name} the {animal} is {age} years old'.format(

animal=animal, name=name, age=age))

5.6.2.2 Idiomatic

list_from_comma_separated_value_file = ['dog', 'Fido', 10]

(animal, name, age) = list_from_comma_separated_value_file

output = ('{name} the {animal} is {age} years old'.format(

animal=animal, name=name, age=age))

5.7 Classes

5.7.1 Use leading underscores in function and variable names to denote “private” data

All attributes of a class, be they data or functions, are inherently “public” in Python. A client is free to add attributes to a class after it’s been defined. In addition, if the class is meant to be inherited from, a subclass may unwittingly change an attribute of the base class. Lastly, it’s generally useful to be able to signal to users of your class that certain portions are logically public (and won’t be changed in a backwards incompatible way) while other attributes are purely internal implementation artifacts and shouldn’t be used directly by client code using the class.

A number of widely followed conventions have arisen to make the author’s intention more explicit and help avoid unintentional naming conflicts. While the following two idioms are commonly referred to as ‘nothing more than conventions,’ both of them, in fact, alter the behavior of the interpreter when used.

First, attributes to be ‘protected’, which are not meant to be used directly by clients, should be prefixed with a single underscore. Second, ‘private’ attributes not meant to be accessible by a subclass should be prefixed by two underscores. Of course, these are (mostly) merely conventions. Nothing would stop a client from being able to access your ‘private’ attributes, but the convention is so widely used you likely won’t run into developers that purposely choose not to honor it. It’s just another example of the Python community settling on a single way of accomplishing something.

Before, I hinted that the single and double underscore prefix were more than mere conventions. Few developers are aware of the fact that prepending attribute names in a class does actually do something. Prepending a single underscore means that the symbol won’t be imported if the ‘all’ idiom is used. Prepending two underscores to an attribute name invokes Python’s name mangling. This has the effect of making it far less likely someone who subclasses your class with inadvertently replace your class’s attribute with something unintended. If Foo is a class, the definition def __bar() will be ‘mangled’ to _classname__attributename.

5.7.1.1 Harmful

class Foo(object):

def __init__(self):

self.id = 8

self.value = self.get_value()

def get_value(self):

pass

def should_destroy_earth(self):

return self.id == 42

class Baz(Foo):

def get_value(self, some_new_parameter):

"""Since 'get_value' is called from the base class's

__init__ method and the base class definition doesn't

take a parameter, trying to create a Baz instance will

fail

"""

pass

class Qux(Foo):

"""We aren't aware of Foo's internals, and we innocently

create an instance attribute named 'id' and set it to 42.

This overwrites Foo's id attribute and we inadvertently

blow up the earth.

"""

def __init__(self):

super(Qux, self).__init__()

self.id = 42

# No relation to Foo's id, purely coincidental

q = Qux()

b = Baz() # Raises 'TypeError'

q.should_destroy_earth() # returns True

q.id == 42 # returns True

5.7.1.2 Idiomatic

class Foo(object):

def __init__(self):

"""Since 'id' is of vital importance to us, we don't

want a derived class accidentally overwriting it. We'll

prepend with double underscores to introduce name

mangling.

"""

self.__id = 8

self.value = self.__get_value() # Call our 'private copy'

def get_value(self):

pass

def should_destroy_earth(self):

return self.__id == 42

# Here, we're storing an 'private copy' of get_value,

# and assigning it to '__get_value'. Even if a derived

# class overrides get_value is a way incompatible with

# ours, we're fine

__get_value = get_value

class Baz(Foo):

def get_value(self, some_new_parameter):

pass

class Qux(Foo):

def __init__(self):

"""Now when we set 'id' to 42, it's not the same 'id'

that 'should_destroy_earth' is concerned with. In fact,

if you inspect a Qux object, you'll find it doesn't

have an __id attribute. So we can't mistakenly change

Foo's __id attribute even if we wanted to.

"""

self.id = 42

# No relation to Foo's id, purely coincidental

super(Qux, self).__init__()

q = Qux()

b = Baz() # Works fine now

q.should_destroy_earth() # returns False

q.id == 42 # returns True

5.7.2 Define __str__ in a class to show a human-readable representation

When defining a class that is likely to be used with print(), the default Python representation isn’t too helpful. By defining a __str__ method, you can control how calling print on an instance of your class will look.

5.7.2.1 Harmful

class Point(object):

def __init__(self, x, y):

self.x = x

self.y = y

p = Point(1, 2)

print (p)

# Prints '<__main__.Point object at 0x91ebd0>'

5.7.2.2 Idiomatic

class Point(object):

def __init__(self, x, y):

self.x = x

self.y = y

def __str__(self):

return '{0}, {1}'.format(self.x, self.y)

p = Point(1, 2)

print (p)

# Prints '1, 2'

5.8 Context Managers

5.8.1 Use a context manager to ensure resources are properly managed

Similar to the RAII principle in languages like C++ and D, context managers (objects meant to be used with the with statement) can make resource management both safer and more explicit. The canonical example is file IO.

Take a look at the “Harmful” code below. What happens if raise_exception does, in fact, raise an exception? Since we haven’t caught it in the code below, it will propagate up the stack. We’ve hit an exit point in our code that might have been overlooked, and we now have no way to close the opened file.

There are a number of classes in the standard library that support or use a context manager. In addition, user defined classes can be easily made to work with a context manager by defining __enter__ and __exit__ methods. Functions may be wrapped with context managers through thecontextlib module.

5.8.1.1 Harmful

file_handle = open(path_to_file, 'r')

for line in file_handle.readlines():

if raise_exception(line):

print('No! An Exception!')

5.8.1.2 Idiomatic

with open(path_to_file, 'r') as file_handle:

for line in file_handle:

if raise_exception(line):

print('No! An Exception!')

5.9 Generators

5.9.1 Prefer a generator expression to a list comprehension for simple iteration

When dealing with a sequence, it is common to need to iterate over a slightly modified version of the sequence a single time. For example, you may want to print out the first names of all of your users in all capital letters.

Your first instinct should be to build and iterate over the sequence in place. A list comprehension seems ideal, but there’s an even better Python built-in: a generator expression.

The main difference? A list comprehension generates a list object and fills in all of the elements immediately. For large lists, this can be prohibitively expensive. The generator returned by a generator expression, on the other hand, generates each element “on-demand”. That list of uppercase user names you want to print out? Probably not a problem. But what if you wanted to write out the title of every book known to the Library of Congress? You’d likely run out of memory in generating your list comprehension, while a generator expression won’t bat an eyelash. A logical extension of the way generator expressions work is that you can use a them on infinite sequences.

5.9.1.1 Harmful

for uppercase_name in [name.upper() for name in get_all_usernames()]:

process_normalized_username(uppercase_name)

5.9.1.2 Idiomatic

for uppercase_name in (name.upper() for name in get_all_usernames()):

process_normalized_username(uppercase_name)

5.9.2 Use a generator to lazily load infinite sequences

Often, it’s useful to provide a way to iterate over a sequence that’s essentially infinite. Other times, you need to provide an interface to a sequence that’s incredibly expensive to calculate, and you don’t want your user sitting on their hands waiting for you to finish building a list.

In both cases, generators are your friend. A generator is a special type of coroutine which returns an iterable. The state of the generator is saved, so that the next call into the generator continues where it left off. In the examples below, we’ll see how to use a generator to help in each of the cases mentioned above.

5.9.2.1 Harmful

def get_twitter_stream_for_keyword(keyword):

"""Get's the 'live stream', but only at the moment

the function is initially called. To get more entries,

the client code needs to keep calling

'get_twitter_livestream_for_user'. Not ideal.

"""

imaginary_twitter_api = ImaginaryTwitterAPI()

if imaginary_twitter_api.can_get_stream_data(keyword):

return imaginary_twitter_api.get_stream(keyword)

current_stream = get_twitter_stream_for_keyword('#jeffknupp')

for tweet in current_stream:

process_tweet(tweet)

# Uh, I want to keep showing tweets until the program is quit.

# What do I do now? Just keep calling

# get_twitter_stream_for_keyword? That seems stupid.

def get_list_of_incredibly_complex_calculation_results(data):

return [first_incredibly_long_calculation(data),

second_incredibly_long_calculation(data),

third_incredibly_long_calculation(data),

]

5.9.2.2 Idiomatic

def get_twitter_stream_for_keyword(keyword):

"""Now, 'get_twitter_stream_for_keyword' is a generator

and will continue to generate Iterable pieces of data

one at a time until 'can_get_stream_data(user)' is

False (which may be never).

"""

imaginary_twitter_api = ImaginaryTwitterAPI()

while imaginary_twitter_api.can_get_stream_data(keyword):

yield imaginary_twitter_api.get_stream(keyword)

# Because it's a generator, I can sit in this loop until

# the client wants to break out

for tweet in get_twitter_stream_for_keyword('#jeffknupp'):

if got_stop_signal:

break

process_tweet(tweet)

def get_list_of_incredibly_complex_calculation_results(data):

"""A simple example to be sure, but now when the client

code iterates over the call to

'get_list_of_incredibly_complex_calculation_results',

we only do as much work as necessary to generate the

current item.

"""

yield first_incredibly_long_calculation(data)

yield second_incredibly_long_calculation(data)

yield third_incredibly_long_calculation(data)