Cython (2015)

Chapter 4. Cython in Practice: N-Body Simulation

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures.

— F. Brooks

This chapter applies the Cython fundamentals discussed in Chapter 3 to a straightforward but nontrivial example using what we have covered so far. The example starts with a pure-Python N-body simulator to model the solar system, and converts the performance-critical components to use Cython constructs. It comes from the widely known computer language benchmarks game, allowing comparison between the pure-Python, Cython, and C implementations of the same program.

This chapter will give us a better understanding of how Cython is used in practice. The pure-C, pure-Python, and converted Cython versions can be found in the example code repository. Interested readers can follow along with the entire example using this resource.

Overview of the N-Body Python Code

The Python N-body code evolves the positions and velocities of the four Jovian planets in a heliocentric coordinate system. Such a system is chaotic, meaning that the long-term evolution of the system is very sensitive to the initial positions and velocities of all bodies. Small perturbations in the initial conditions lead to arbitrarily diverging results, making prediction difficult. When we are simulating a chaotic system, it is important that the algorithm, or integrator, be highly accurate. For this reason the N-body code uses a symplectic integrator, which is a fancy term for a time-stepping scheme that does a really good job of computing the right trajectories.

The time step and the initial positions, velocities, and masses of the Jovian planets are given. By passing in a command-line argument, we can vary the number of time steps the integrator takes.

The main routine is straightforward. It takes the number of steps to integrate (n) the initial conditions of the celestial bodies to integrate, and a reference body (in this case, the Sun):

def main(n, bodies=BODIES, ref='sun'):

# ...

It first gets a list of all the bodies and makes pairs of all of them for convenience, as many functions need to iterate over all unique pairs:

# ...

system = list(bodies.values())

pairs = combinations(system)

It then calls offset_momentum to correct the Sun’s momentum so that it stays at the system’s center of mass:

# ...

offset_momentum(bodies[ref], system)

Before running the integrator, main first calls report_energy to compute and print the system’s total energy:

# ...

report_energy(system, pairs)

Symplectic integrators are very good at conserving energy, and we will use energy conservation as a way to test the accuracy of the integrator.

After getting the initial energy, we then call advance, the core of the computation, and pass in the time step, the number of steps to take, and the sequence of paired bodies:

# ...

advance(0.01, n, system, pairs)

For this simulation, the unit of time is the mean solar day, the unit of distance is one astronomical unit, and the unit of mass is the solar mass.

After advancing the system, we output the total energy again:

# ...

report_energy(system, pairs)

Its value should be close to the total energy computed before advance was called.

Let’s try it out from the command line:

$ time python nbody.py 500000

-0.169075164

-0.169096567

python nbody.py 500000 13.21s user 0.04s system 99% cpu 13.286 total

The energy before and after match to nearly five decimal places.

This pure-Python version requires about 13 seconds to advance 500,000 steps. When all is said and done, Cython will improve performance by nearly two orders of magnitude, approaching the performance of a pure-C version of the same algorithm.

Converting to Cython

Let’s first run our pure-Python version under cProfile to quantify where the runtime is spent:

$ ipython --no-banner

In [1]: %run -p nbody.py 500000

71 function calls in 13.897 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

1 13.880 13.880 13.896 13.896 nbody.py:59(advance)

2 0.015 0.008 0.015 0.008 {range}

1 0.001 0.001 13.897 13.897 {execfile}

2 0.000 0.000 0.000 0.000 nbody.py:82(report_energy)

...

It is not surprising to find that advance consumes 99.9 percent of the runtime. Converting it to use static types and more efficient data structures is the right approach. The rest of the code can remain as is.

Before we begin converting our code to Cython, we first copy the nbody.py file to nbody.pyx, which allows us to use Cython-specific declarations and constructs.

Let’s compile and run the Cython version to ensure the program works correctly. To compile, we use a simple distuils script named setup.py:

from distutils.core import setup

from Cython.Build import cythonize

setup(name="nbody",

ext_modules=cythonize("nbody.pyx"))

We need a run_nbody.py driver script to run the main function inside our nbody extension module:

import sys

from nbody import main

main(int(sys.argv[1]))

Building our extension is straightforward:

$ python setup.py build_ext -i

(Consult Chapter 2 for platform-specific compilation instructions.)

After compiling our extension, we can test that we obtain the same results as before:

$ time python run_nbody.py 500000

-0.169075164

-0.169096567

python run_nbody.py 500000 4.78s user 0.03s system 99% cpu 4.821 total

The output is identical to the pure-Python version’s, and the performance already improved by a factor of 2.8. Cython provides this performance improvement essentially for free.

With our compilation infrastructure in place, we can turn our attention to improving performance further still.

Python Data Structures and Organization

In Python, each celestial body is represented as a tuple with three elements: two three-element lists for the position and velocity, and a float value for the mass. For example, the Sun’s initial condition is represented by the following three-element tuple:

([0.0, 0.0, 0.0], # position

[0.0, 0.0, 0.0], # velocity

SOLAR_MASS # mass

)

And Jupiter’s is:

([ 4.84143144246472090e+00,

-1.16032004402742839e+00,

-1.03622044471123109e-01],

[ 1.66007664274403694e-03 * DAYS_PER_YEAR,

7.69901118419740425e-03 * DAYS_PER_YEAR,

-6.90460016972063023e-05 * DAYS_PER_YEAR],

9.54791938424326609e-04 * SOLAR_MASS),

The global constants DAYS_PER_YEAR and SOLAR_MASS are defined normalization parameters.

The system variable is a list of these tuples, and pairs is a list of all pairs of these tuples. The simulation will access and update the positions and velocities of all planets frequently, so optimizing their representation is essential.

The advance function loops over all steps, and for each step, loops over all pairs of bodies:

def advance(dt, n, bodies, pairs):

for i inrange(n):

for (([x1, y1, z1], v1, m1),

([x2, y2, z2], v2, m2)) inpairs:

# ...update velocities...

Here we use tuple unpacking to extract the positions (x1, x2, y1, y2, etc.), the velocity lists v1 and v2, and the masses m1 and m2 from each pair in pairs. The body of the loop updates the velocities according to the symplectic integration algorithm.

Once the velocities are updated, we update the positions:

for (r, [vx, vy, vz], m) inbodies:

r[0] += dt * vx

r[1] += dt * vy

r[2] += dt * vz

The bodies and pairs sequences are set up to refer to the same objects, so updating the velocities in the first loop allows us to update the positions in the second, even though we are looping over different sequences.

Converting Data Structures to structs

Our strategy to improve performance is to convert the pure-Python list-of-tuples-of-lists-of-floats into a C array of C structs. With the C version, accessing and updating the planet’s data will have much better performance, as these operations will use fast C iteration and optimized lookups, rather than the general (and slow) iteration and lookups we know to expect from the Python interpreter.

Let’s define a struct, body_t, that has two double arrays for the body’s position and velocity, and a single double for its mass:

cdef struct body_t:

double x[3]

double v[3]

double m

We place this struct definition toward the top of nbody.pyx.

Another goal is to leave most of the nbody.py code unmodified, and use our body_t struct only where performance matters.

The advance function needs to convert the Python list of tuples of celestial body data into a C array of body_t elements. Let’s make a cdef function pair to convert between Python and C data types.

First, make_cbodies converts a Python list of tuples into a C array of body_t structs. It takes a bodies Python list and a preallocated C array of body_ts:

cdef void make_cbodies(list bodies, body_t *cbodies)

The implementation simply loops over the bodies list and initializes the preallocated cbodies array with the Python list’s data:

cdef void make_cbodies(list bodies, body_t *cbodies, int num_cbodies):

cdef body_t *cbody

for i, body inenumerate(bodies):

if i >= num_cbodies:

break

(x, v, m) = body

cbody = &cbodies[i]

cbody.x[0], cbody.x[1], cbody.x[2] = x

cbody.v[0], cbody.v[1], cbody.v[2] = v

cbodies[i].m = m

Its complement, make_pybodies, converts a body_t array into a Python list of tuples:

cdef list make_pybodies(body_t *cbodies, int num_cbodies):

pybodies = []

for i inrange(num_cbodies):

x = [cbodies[i].x[0], cbodies[i].x[1], cbodies[i].x[2]]

v = [cbodies[i].v[0], cbodies[i].v[1], cbodies[i].v[2]]

pybodies.append((x, v, cbodies[i].m))

return pybodies

Now we are ready to convert the for loops in advance to use static types. First, consider the original loop body:

def advance(dt, n, bodies, pairs):

# ...

for (([x1, y1, z1], v1, m1),

([x2, y2, z2], v2, m2)) inpairs:

dx = x1 - x2

dy = y1 - y2

dz = z1 - z2

mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))

b1m = m1 * mag

b2m = m2 * mag

v1[0] -= dx * b2m

v1[1] -= dy * b2m

v1[2] -= dz * b2m

v2[0] += dx * b1m

v2[1] += dy * b1m

v2[2] += dz * b1m

The Cython version is as follows:

def advance(double dt, int n, bodies):

cdef:

int i, ii, jj

double dx, dy, dz, mag, b1m, b2m

body_t *body1

body_t *body2

body_t cbodies[NBODIES]

make_cbodies(bodies, cbodies, NBODIES)

for i inrange(n):

for ii inrange(NBODIES-1):

body1 = &cbodies[ii]

for jj inrange(ii+1, NBODIES):

body2 = &cbodies[jj]

dx = body1.x[0] - body2.x[0]

dy = body1.x[1] - body2.x[1]

dz = body1.x[2] - body2.x[2]

mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))

b1m = body1.m * mag

b2m = body2.m * mag

body1.v[0] -= dx * b2m

body1.v[1] -= dy * b2m

body1.v[2] -= dz * b2m

body2.v[0] += dx * b1m

body2.v[1] += dy * b1m

body2.v[2] += dz * b1m

for ii inrange(NBODIES):

body2 = &cbodies[ii]

body2.x[0] += dt * body2.v[0]

body2.x[1] += dt * body2.v[1]

body2.x[2] += dt * body2.v[2]

return make_pybodies(cbodies, NBODIES)

We convert the for loop over pairs into nested for loops over indices into the C array of body_t structs. We use two body_t pointers to refer to the current bodies in the pair.

We removed the pairs argument to advance, so we need to update main to reflect this change, but we will not show the modification here.

Running the Cythonized Version

After recompiling our code, we can run our latest Cython version and see how it compares to the Python version:

$ time python run_nbody.py 500000

-0.169075164

-0.169096567

python run_nbody.py 500000 0.54s user 0.01s system 99% cpu 0.550 total

Our Cython version takes about 0.4 seconds to run, and the energy values are in agreement. This is about 25 times faster than the pure Python version.

We can compare this to the runtime of a serial hand-written C version obtained from the computer language benchmarks game, which we compile with equivalent optimization flags:

$ time ./nbody.x 500000

-0.169075164

-0.169096567

./nbody.x 500000 0.14s user 0.00s system 97% cpu 0.150 total

Our performance thus far is within a factor of four of the C version.

A quick comparison of the C version’s advance function and our version reveals one important difference when the distance is computed—the C version uses sqrt:

double inv_distance = 1.0 / sqrt(dx * dx + dy * dy + dz * dz);

double mag = inv_distance * inv_distance * inv_distance;

while our version uses the ** operator, which Cython translates to pow:

mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))

It is straightforward to convert our version to use sqrt:

ds = dx * dx + dy * dy + dz * dz

mag = dt / (ds * sqrt(ds))

This requires that we type ds as a double and add a cimport line at the top of the file (Chapter 6):

from libc.math cimport sqrt

With this minor syntactic change, we see another significant performance boost:

$ time python ./run_nbody.py 500000

-0.169075164

-0.169096567

python ./run_nbody.py 500000 0.15s user 0.01s system 99% cpu 0.159 total

This last improvement yields code that is a factor of 3.6 faster than the previous version, is a factor of 90 faster than the pure-Python version, and brings us within a factor of 1.25 of the pure-C version’s performance.

Summary

This chapter demonstrates how to take numeric-heavy Python code and convert it to Cython, achieving a factor-of-90 boost in performance. The approach we used is straightforward and ensures that we get the most payoff for our efforts.

The steps we followed are:

1. Profile the pure-Python version (using the cProfile module or IPython’s %run -p magic command) to determine where the code spends its time. In this example, nearly all the runtime is spent in the loop-heavy advance function.

2. Inspect the hotspots for nested for loops, numeric-heavy operations, and nested Python containers, all of which can be easily converted with Cython to use more efficient C-level constructs. This example happens to have all of the above.

3. Use Cython to declare C data structures equivalent to the Python data structures identified above. Create converters (if necessary) to transform Python data to C data. In the N-body simulation, we created a body_t struct to represent the nested list-of-tuples-of-lists-of-floats Python data in C, which has better data locality and significantly more efficient access. We also created two converters, make_cbodies and make_pybodies, to convert Python to C and C to Python, respectively. Sometimes these converters are not necessary if Cython can convert the data automatically.

4. Convert the hotspots to use our C-level data structures. Remove Python data structures from nested loops to the extent possible. Ensure all variables used in nested loops (including the loop variables themselves) are statically typed. Our make_pybodies and make_cbodiesconverters, coupled with plenty of cdef declarations, were sufficient in this example.

5. Test the code to ensure the modifications have not changed the semantics. Profile again. If performance is not satisfactory, use Cython profiling tools (Chapter 9) to draw attention to inefficient code.

6. Repeat as necessary.

Another goal of this chapter was to show how to use the components covered in Chapter 3 in a realistic setting. Remembering the Pareto principle (or the 80/20 rule) is useful: we need only use Cython in the 20 percent of the code that occupies 80 percent (or more) of the runtime. The other 80 percent of the code can (and should) remain unmodified.

Studying this example end-to-end is a good exercise for the Cython newcomer; understanding it fully will solidify many core concepts and techniques useful for any Cython project.