High Performance Python (2014)

Chapter 5. Iterators and Generators

QUESTIONS YOU’LL BE ABLE TO ANSWER AFTER THIS CHAPTER

§ How do generators save memory?

§ When is the best time to use a generator?

§ How can I use itertools to create complex generator workflows?

§ When is lazy evaluation beneficial, and when is it not?

When many people with experience in another language start learning Python, they are taken aback by the difference in for loop notation. That is to say, instead of writing:

# Other languages

for (i=0; i<N; i++) {

do_work(i);

}

they are instead introduced to a new function called range or xrange:

# Python

for i inrange(N):

do_work(i)

These two functions provide insight into the paradigm of programming using generators. In order to fully understand generators, let us first make simple implementations of the range and xrange functions:

def range(start, stop, step=1):

numbers = []

while start < stop:

numbers.append(start)

start += step

return numbers

def xrange(start, stop, step=1):

while start < stop:

yield start #

start += step

for i inrange(1,10000):

pass

for i inxrange(1,10000):

pass

This function will yield many values instead of returning one value. This turns this regular-looking function into a generator that can be repeatedly polled for the next available value.

The first thing to note is that the range implementation must precreate the list of all numbers within the range. So, if the range is from 1 to 10,000, the function will do 10,000 appends to the numbers list (which, as we discussed in Chapter 3, has overhead associated with it), and then return it. On the other hand, the generator is able to “return” many values. Every time the code gets to the yield, the function emits its value, and when another value is requested the function resumes running (maintaining its previous state) and emits the new value. When the function reaches its end, a StopIteration exception is thrown indicating that the given generator has no more values. As a result, even though both functions must, in the end, do the same number of calculations, the range version of the preceding loop uses 10x more memory (or Nx more memory if we are ranging from 1 to N).

With this code in mind, we can decompose the for loops that use our implementations of range and xrange. In Python, for loops require that the object we are looping over supports iteration. This means that we must be able to create an iterator out of the object we want to loop over. To create an iterator from almost any object, we can simply use Python’s built-in iter function. This function, for lists, tuples, dictionaries, and sets, returns an iterator over the items or keys in the object. For more complex objects, iter returns the result of the __iter__ property of the object. Since xrange already returns an iterator, calling iter on it is a trivial operation, and it simply returns the original object (so type(xrange(1,10)) == type(iter(xrange(1,10)))). However, since range returns a list, we must create a new object, a list iterator, that will iterate over all values in the list. Once an iterator is created, we simply call the next() function on it, retrieving new values until a StopIteration exception is thrown. This gives us a good deconstructed view of for loops, as illustrated in Example 5-1.

Example 5-1. Python for loop deconstructed

# The python loop

for i inobject:

do_work(i)

# Is equivalent to

object_iterator = iter(object)

while True:

try:

i = object_iterator.next()

do_work(i)

except StopIteration:

break

The for loop code shows that we are doing extra work calling iter when using range instead of xrange. When using xrange, we create a generator that is trivially transformed into an iterator (since it is already an iterator!); however, for range we need to allocate a new list and precompute its values, and then we still must create an iterator!

More importantly, precomputing the range list requires allocating enough space for the full dataset and setting each element to the correct value, even though we only ever require one value at a time. This also makes the list allocation useless. In fact, it may even make the loop unrunnable, because range may be trying to allocate more memory than is available (range(100,000,000) would create a list 3.1 GB large!). By timing the results, we can see this very explicitly:

def test_range():

"""

>>> %timeit test_range()

1 loops, best of 3: 446 ms per loop

"""

for i inrange(1, 10000000):

pass

def test_xrange():

"""

>>> %timeit test_xrange()

1 loops, best of 3: 276 ms per loop

"""

for i inxrange(1, 10000000):

pass

This may seem like a trivial enough problem to solve—simply replace all range calls with xrange—but the problem is actually a lot deeper. Let’s say we had a long list of numbers, and we wanted to know how many of them are divisible by 3. This could look like:

divisible_by_three = len([n for n inlist_of_numbers if n % 3 == 0])

However, this suffers from the same problem as range does. Since we are doing list comprehension, we are pregenerating the list of numbers divisible by 3, only to do a calculation on it and forget that list. If this list of numbers was quite large, this could mean allocating a lot of memory—potentially more than is available—for almost no reason.

Recall that we can create a list comprehension using a statement of the form [<value> for <item> in <sequence> if <condition>]. This will create a list of all the <value> items. Alternatively, we can use similar syntax to create a generator of the <value> items instead of a list by doing (<value> for <item> in <sequence> if <condition>).

Using this subtle difference between list comprehension and generator comprehension, we can optimize the preceding code for divisible_by_three. However, generators do not have a length property. As a result, we will have to be a bit clever:

divisible_by_three = sum((1 for n inlist_of_numbers if n % 3 == 0))

Here, we have a generator that emits a value of 1 whenever it encounters a number divisible by 3, and nothing otherwise. By summing all elements in this generator we are essentially doing the same as the list comprehension version. The performance of the two versions of this code is almost equivalent, but the memory impact of the generator version is far less than that of the list comprehension. Furthermore, we are able to simply transform the list version into a generator because all that matters for each element of the list is its current value—either the number is divisible by 3 or it is not; it doesn’t matter where its placement is in the list of numbers or what the previous/next values are. More complex functions can also be transformed into generators, but depending on their reliance on state, this can become a difficult thing to do.

Iterators for Infinite Series

Since we only need to store some version of a state and emit only the current value, generators are ideal for infinite series. The Fibonacci series is a great example—it is an infinite series with two state variables (the two last Fibonacci numbers):

def fibonacci():

i, j = 0, 1

while True:

yield j

i, j = j, i + j

We see here that although j is the value being emitted, we keep track of i as well since this holds the state of the Fibonacci series. The amount of state necessary for a calculation is quite important for generators because it translates into the actual memory footprint of the object. This makes it so that if we have a function that uses a lot of state and outputs a very small amount of data, it may be better to have this function precompute the list of data rather than have a generator for it.

One reason why generators aren’t used as much as they could be is that a lot of the logic within them can be encapsulated in your logic code. This means that generators are really a way of organizing your code and having smarter loops. For example, we could answer the question, “How many Fibonacci numbers below 5,000 are odd?” in multiple ways:

def fibonacci_naive():

i, j = 0, 1

count = 0

while j <= 5000:

if j % 2:

count += 1

i, j = j, i + j

return count

def fibonacci_transform():

count = 0

for f infibonacci():

if f > 5000:

break

if f % 2:

count += 1

return count

from itertools import islice

def fibonacci_succinct():

is_odd = lambda x : x % 2

first_5000 = islice(fibonacci(), 0, 5000)

return sum(1 for x infirst_5000 if is_odd(x))

All of these methods have similar runtime properties (namely, they all have the same memory footprint and the same performance), but the fibonacci_transform function benefits from several things. Firstly, it is much more verbose than fibonacci_succinct, which means it will be easy for another developer to debug and understand. The latter mainly stands as a warning for the next section, where we cover some common workflows using itertools—while the module greatly simplifies many simple actions with iterators, it can also quickly make Python code very un-Pythonic. Conversely, fibonacci_naive is doing multiple things at a time, which hides the actual calculation it is doing! While it is obvious in the generator function that we are iterating over the Fibonacci numbers, we are not overencumbered by the actual calculation. Lastly,fibonacci_transform is more generalizable. This function could be renamed num_odd_under_5000 and take in the generator by argument, and thus work over any series.

One last benefit of the fibonacci_transform function is that it supports the notion that in computation there are two phases: generating data and transforming data. This function is very clearly performing a transformation on data, while the fibonacci function generates it. This clear demarcation adds extra clarity and extra functionality: we can move a transformative function to work on a new set of data, or perform multiple transformations on existing data. This paradigm has always been important when creating complex programs; however, generators facilitate this clearly by making generators responsible for creating the data, and normal functions responsible for acting on the generated data.

Lazy Generator Evaluation

As touched on previously, the way we get the memory benefits with a generator is by dealing only with the current values of interest. At any point in our calculation with a generator, we only have the current value and cannot reference any other items in the sequence (algorithms that perform this way are generally called “single pass” or “online”). This can sometimes make generators more difficult to use, but there are many modules and functions that can help.

The main library of interest is itertools, in the standard library. It supplies generator versions of Python’s built-in functions map, reduce, filter, and zip (called imap, ireduce, ifilter, and izip in itertools), in addition to many other useful functions. Of particular note are:

islice

Allows slicing a potentially infinite generator

chain

Chains together multiple generators

takewhile

Adds a condition that will end a generator

cycle

Makes a finite generator infinite by constantly repeating it

Let’s build up an example of using generators to analyze a large dataset. Let’s say we’ve had an analysis routine going over temporal data, one piece of data per second, for the last 20 years—that’s 631,152,000 data points! The data is stored in a file, one second per line, and we cannot load the entire dataset into memory. If we wanted to do some simple anomaly detection, we could use generators and never allocate any lists!

The problem will be: given a data file of the form “timestamp, value”, find days with a value 3 sigma away from that day’s mean. We start by writing the code that will read the file, line by line, and output each line’s value as a Python object. We will also create a read_fake_datagenerator to generate fake data we can test our algorithms with. For this function we still take the argument filename, so as to have the same function signature as read_data; however, we will simply disregard it. These two functions, shown in Example 5-2, are indeed lazily evaluated—we only read the next line in the file, or generate new fake data, when the next() property of the generator is called.

Example 5-2. Lazily reading data

from random import normalvariate, rand

from itertools import count

def read_data(filename):

with open(filename) as fd:

for line infd:

data = line.strip().split(',')

yield map(int, data)

def read_fake_data(filename):

for i incount():

sigma = rand() * 10

yield (i, normalvariate(0, sigma))

Now, we can use the groupby function in itertools to group together timestamps that occur on the same day (Example 5-3). This function works by taking in a sequence of items and a key used to group these items. The output is a generator that produces tuples whose items are the key for the group and a generator for the items in the group. As our key function, we will create a lambda function that returns a date object. These date objects are equal when they occur on the same day, which will group them by day. This “key” function could be anything—we could group our data by hour, by year, or by some property in the actual value. The only limitation is that groups will only be formed for data that is sequential. So, if we had the input A A A A B B A A and had groupby group by the letter, we would get three groups, (A, [A, A, A, A]), (B, [B, B]), and (A, [A, A]).

Example 5-3. Grouping our data

from datetime import date

from itertools import groupby

def day_grouper(iterable):

key = lambda (timestamp, value) : date.fromtimestamp(timestamp)

return groupby(iterable, key)

Now to do the actual anomaly detection. We will do this by iterating through a day’s values and keeping track of the mean and maximum values. The mean will be calculated using an online mean and standard deviation algorithm.^[8^] The maximum is kept because it is the best candidate for being anomalous—if the max is more than 3 sigma larger than the mean, then we will return the date object representing the day we just analyzed. Otherwise, we will return False for posterity; however, we could equally have just ended the function (and implicitly returned None). We output these values because this check_anomaly function is defined as a data filter—a function that returns True for data that should be kept and False for data that should be discarded. This will allow us to filter through the original dataset and only keep days that match our condition. The check_anomaly function is shown in Example 5-4.

Example 5-4. Generator-based anomaly detection

import math

def check_anomaly((day, day_data)):

# We find the mean, standard deviation, and maximum values for the day.

# Using a single-pass mean/standard deviation algorithm allows us to only

# read through the day's data once.

n = 0

mean = 0

M2 = 0

max_value = None

for timestamp, value inday_data:

n += 1

delta = value - mean

mean = mean + delta/n

M2 += delta*(value - mean)

max_value = max(max_value, value)

variance = M2/(n - 1)

standard_deviation = math.sqrt(variance)

# Here is the actual check of whether that day's data is anomalous. If it

# is, we return the value of the day; otherwise, we return false.

if max_value > mean + 3 * standard_deviation:

return day

return False

One aspect of this function that may seem strange is the extra set of parentheses in the parameter definition. This is not a typo, but a result of this function taking input from the groupby generator. Recall that groupby yields tuples that become the parameters to this check_anomalyfunction. As a result, we must do tuple expansion in order to properly extract the group key and the group data. Since we are using ifilter, another way of dealing with this instead of having the tuple expansion inside the function definition is to define istarfilter, which is similar to what istarmap does to imap (see the itertools documentation for more information).

Finally, we can put together the chain of generators to get the days that had anomalous data (Example 5-5).

Example 5-5. Chaining together our generators

from itertools import ifilter, imap

data = read_data(data_filename)

data_day = day_grouper(data)

anomalous_dates = ifilter(None, imap(check_anomaly, data_day)) #

first_anomalous_date, first_anomalous_data = anomalous_dates.next()

print "The first anomalous date is: ", first_anomalous_date

ifilter will remove any elements that do not satisfy the given filter. By default (sending None to the first parameter triggers this), ifilter will filter out any elements that evaluate to False. This makes it so we don’t include any days that check_anomaly didn’t think were anomalous.

This method very simply allows us to get the list of days that are anomalous without having to load the entire dataset. One thing to note is that this code does not actually do any calculation; it simply sets up the pipeline to do the calculation. The file will never be read until we doanomalous_dates.next() or somehow iterate on the anomalous_dates generator. In fact, we only ever do the analysis when a new value is requested from anomalous_dates. So, if our full dataset has five anomalous dates, but in our code we retrieve one and then stop requesting new values, the file will only be read up to the point where this day’s data appears. This is called lazy evaluation—only the calculations that are explicitly requested are performed, which can drastically reduce overall runtime if there is an early termination condition.

Another nicety about organizing analysis this way is it allows us to do more expansive calculations easily, without having to rework large parts of the code. For example, if we want to have a moving window of one day instead of chunking up by days, we can simply write a newday_grouper that does this and use it instead:

from datetime import datetime

def rolling_window_grouper(data, window_size=3600):

window = tuple(islice(data, 0, window_size))

while True:

current_datetime = datetime.fromtimestamp(window[0][0])

yield (current_datetime, window)

window = window[1:] + (data.next(),)

Now, we simply replace the call to day_grouper in Example 5-5 with a call to rolling_window_grouper and we get the desired result. In this version we also see very explicitly the memory guarantee of this and the previous method—it will store only the window’s worth of data as state (in both cases, one day, or 3,600 data points). We can change this by opening the file multiple times and using the various life descriptors to point to the exact data we like (or using the linecache module). However, this is only necessary if this subsampling of the dataset still doesn’t fit in memory.

A final note: in the rolling_window_grouper function, we perform many pop and append operations on the list window. We can greatly optimize this by using the deque object in the collections module. This object gives us O(1) appends and removals to and from the beginning or end of a list (while normal lists are O(1) for appends or removals to/from the end of the list and O(n) for the same operations at the beginning of the list). Using the deque object, we can append the new data to the right (or end) of the list and use deque.popleft() to delete data from the left (or beginning) of the list without having to allocate more space or perform long O(n) operations.

Wrap-Up

By formulating our anomaly-finding algorithm with iterators, we are able to process much more data than could fit into memory. What’s more, we are able to do it faster than if we had used lists, since we avoid all the costly append operations.

Since iterators are a primitive type in Python, this should always be a go-to method for trying to reduce the memory footprint of an application. The benefits are that results are lazily evaluated, so you only ever process the data you need, and memory is saved since we don’t store previous results unless explicitly required to. In Chapter 11, we will talk about other methods that can be used for more specific problems and introduce some new ways of looking at problems when RAM is an issue.

Another benefit of solving problems using iterators is that it prepares your code to be used on multiple CPUs or multiple computers, as we will see in Chapters 9 and 10. As we discussed in Iterators for Infinite Series, when working with iterators you must always think about the various states that are necessary for your algorithm to work. Once you figure out how to package the state necessary for the algorithm to run, it doesn’t matter where it runs. We can see this sort of paradigm, for example, with the multiprocessing and ipython modules, both of which use a map-like function to launch parallel tasks.

^[8^]We use Knuth’s online mean algorithm. This lets us calculate the mean and the first moment (in this case, the standard deviation) using a single temporary variable. We could also calculate further moments by modifying the equations slightly and adding more state variables (one per moment). More information can be found at http://www.johndcook.com/standard_deviation.html.