Working with Datasets - Python and HDF5 (2013)

Python and HDF5 (2013)

Chapter 3. Working with Datasets

Datasets are the central feature of HDF5. You can think of them as NumPy arrays that live on disk. Every dataset in HDF5 has a name, a type, and a shape, and supports random access. Unlike the built-in np.save and friends, there’s no need to read and write the entire array as a block; you can use the standard NumPy syntax for slicing to read and write just the parts you want.

Dataset Basics

First, let’s create a file so we have somewhere to store our datasets:

>>> f = h5py.File("testfile.hdf5")

Every dataset in an HDF5 file has a name. Let’s see what happens if we just assign a new NumPy array to a name in the file:

>>> arr = np.ones((5,2))

>>> f["my dataset"] = arr

>>> dset = f["my dataset"]

>>> dset

<HDF5 dataset "my dataset": shape (5, 2), type "<f8">

We put in a NumPy array but got back something else: an instance of the class h5py.Dataset. This is a “proxy” object that lets you read and write to the underlying HDF5 dataset on disk.

Type and Shape

Let’s explore the Dataset object. If you’re using IPython, type dset. and hit Tab to see the object’s attributes; otherwise, do dir(dset). There are a lot, but a few stand out:

>>> dset.dtype

dtype('float64')

Each dataset has a fixed type that is defined when it’s created and can never be changed. HDF5 has a vast, expressive type mechanism that can easily handle the built-in NumPy types, with few exceptions. For this reason, h5py always expresses the type of a dataset using standard NumPydtype objects.

There’s another familiar attribute:

>>> dset.shape

(5, 2)

A dataset’s shape is also defined when it’s created, although as we’ll see later, it can be changed. Like NumPy arrays, HDF5 datasets can have between zero axes (scalar, shape ()) and 32 axes. Dataset axes can be up to 263-1 elements long.

Reading and Writing

Datasets wouldn’t be that useful if we couldn’t get at the underlying data. First, let’s see what happens if we just read the entire dataset:

>>> out = dset[...]

>>> out

array([[ 1., 1.],

[ 1., 1.],

[ 1., 1.],

[ 1., 1.],

[ 1., 1.]])

>>> type(out)

<type 'numpy.ndarray'>

Slicing into a Dataset object returns a NumPy array. Keep in mind what’s actually happening when you do this: h5py translates your slicing selection into a portion of the dataset and has HDF5 read the data from disk. In other words, ignoring caching, a slicing operation results in a read or write to disk.

Let’s try updating just a portion of the dataset:

>>> dset[1:4,1] = 2.0

>>> dset[...]

array([[ 1., 1.],

[ 1., 2.],

[ 1., 2.],

[ 1., 2.],

[ 1., 1.]])

Success!

NOTE

Because Dataset objects are so similar to NumPy arrays, you may be tempted to mix them in with computational code. This may work for a while, but generally causes performance problems as the data is on disk instead of in memory.

Creating Empty Datasets

You don’t need to have a NumPy array at the ready to create a dataset. The method create_dataset on our File object can create empty datasets from a shape and type, or even just a shape (in which case the type will be np.float32, native single-precision float):

>>> dset = f.create_dataset("test1", (10, 10))

>>> dset

<HDF5 dataset "test1": shape (10, 10), type "<f4">

>>> dset = f.create_dataset("test2", (10, 10), dtype=np.complex64)

>>> dset

<HDF5 dataset "test2": shape (10, 10), type "<c8">

HDF5 is smart enough to only allocate as much space on disk as it actually needs to store the data you write. Here’s an example: suppose you want to create a 1D dataset that can hold 4 gigabytes worth of data samples from a long-running experiment:

>>> dset = f.create_dataset("big dataset", (1024**3,), dtype=np.float32)

Now write some data to it. To be fair, we also ask HDF5 to flush its buffers and actually write to disk:

>>> dset[0:1024] = np.arange(1024)

>>> f.flush()

Looking at the file size on disk:

$ ls -lh testfile.hdf5

-rw-r--r-- 1 computer computer 66K Mar 6 21:23 testfile.hdf5

Saving Space with Explicit Storage Types

When it comes to types, a few seconds of thought can save you a lot of disk space and also reduce your I/O time. The create_dataset method can accept almost any NumPy dtype for the underlying dataset, and crucially, it doesn’t have to exactly match the type of data you later write to the dataset.

Here’s an example: one common use for HDF5 is to store numerical floating-point data—for example, time series from digitizers, stock prices, computer simulations—anywhere it’s necessary to represent “real-world” numbers that aren’t integers.

Often, to keep the accuracy of calculations high, 8-byte double-precision numbers will be used in memory (NumPy dtype float64), to minimize the effects of rounding error. However, it’s common practice to store these data points on disk as single-precision, 4-byte numbers (float32), saving a factor of 2 in file size.

Let’s suppose we have such a NumPy array called bigdata:

>>> bigdata = np.ones((100,1000))

>>> bigdata.dtype

dtype('float64')

>>> bigdata.shape

(100, 1000)

We could store this in a file by simple assignment, resulting in a double-precision dataset:

>>> with h5py.File('big1.hdf5','w') as f1:

... f1['big'] = bigdata

$ ls -lh big1.hdf5

-rw-r--r-- 1 computer computer 784K Apr 13 14:40 foo.hdf5

Or we could request that HDF5 store it as single-precision data:

>>> with h5py.File('big2.hdf5','w') as f2:

... f2.create_dataset('big', data=bigdata, dtype=np.float32)

$ ls -lh big2.hdf5

-rw-r--r-- 1 computer computer 393K Apr 13 14:42 foo.hdf5

Keep in mind that whichever one you choose, your data will emerge from the file in that format:

>>> f1 = h5py.File("big1.hdf5")

>>> f2 = h5py.File("big2.hdf5")

>>> f1['big'].dtype

dtype('float64')

>>> f2['big'].dtype

dtype('float32')

Automatic Type Conversion and Direct Reads

But exactly how and when does the data get converted between the double-precision float64 in memory and the single-precision float32 in the file? This question is important for performance; after all, if you have a dataset that takes up 90% of the memory in your computer and you need to make a copy before storing it, there are going to be problems.

The HDF5 library itself handles type conversion, and does it on the fly when saving to or reading from a file. Nothing happens at the Python level; your array goes in, and the appropriate bytes come out on disk. There are built-in routines to convert between many source and destination formats, including between all flavors of floating-point and integer numbers available in NumPy.

But what if we want to go the other direction? Suppose we have a single-precision float dataset on disk, and want to read it in as double precision? There are a couple of reasons this might be useful. The result might be very large, and we might not have the memory space to hold both single- and double-precision versions in Python while we do the conversion. Or we might want to do the type conversion on the fly while reading from disk, to reduce the application’s runtime.

For big arrays, the best approach is to read directly into a preallocated NumPy array of the desired type. Let’s say we have the single-precision dataset from the previous example, and we want to read it in as double precision:

>>> dset = f2['big']

>>> dset.dtype

dtype('float32')

>>> dset.shape

(100, 1000)

We allocate our new double-precision array on the Python side:

>>> big_out = np.empty((100, 1000), dtype=np.float64)

Here np.empty creates the array, but unlike np.zeros or np.ones it doesn’t bother to initialize the array elements. Now we request that HDF5 read directly into our output array:

>>> dset.read_direct(big_out)

That’s it! HDF5 fills up the empty array with the requested data. No extra arrays or time spent converting.

NOTE

When using read_direct, you don’t always have to read the whole dataset. See Reading Directly into an Existing Array for details.

Reading with astype

You may not always want to go through the whole rigamarole of creating a destination array and passing it to read_direct. Another way to tell HDF5 what type you want is by using the Dataset.astype context manager.

Let’s suppose we want to read the first 1000 elements of our “big” dataset in the previous example, and have HDF5 itself convert them from single to double precision:

>>> with dset.astype('float64'):

... out = dset[0,:]

>>> out.dtype

dtype('float64')

Finally, here are some tips to keep in mind when using HDF5’s automatic type conversion. They apply both to reads with read_direct or astype and also to writing data from NumPy into existing datasets:

1. Generally, you can only convert between types of the same “flavor.” For example, you can convert integers to floats, and floats to other floats, but not strings to floats or integers. You’ll get an unhelpful-looking IOError if you try.

2. When you’re converting to a “smaller” type (float64 to float32, or "S10" to "S5"), HDF5 will truncate or “clip” the values:

>>> f.create_dataset('x', data=1e256, dtype=np.float64)

>>> print f['x'][...]

1e+256

>>> f.create_dataset('y', data=1e256, dtype=np.float32)

>>> print f['y'][...]

inf

There’s no warning when this happens, so it’s in your interest to keep track of the types involved.

Reshaping an Existing Array

There’s one more trick up our sleeve with create_dataset, although this one’s a little more esoteric. You’ll recall that it takes a “shape” argument as well as a dtype argument. As long as the total number of elements match, you can specify a shape different from the shape of your input array.

Let’s suppose we have an array that stores 100 640×480-pixel images, stored as 640-element “scanlines”:

>>> imagedata.shape

(100, 480, 640)

Now suppose that we want to store each image as a “top half” and “bottom half” without needing to do the slicing when we read. When we go to create the dataset, we simply specify the new shape:

>>> f.create_dataset('newshape', data=imagedata, shape=(100, 2, 240, 640))

There’s no performance penalty. Like the built-in np.reshape, only the indices are shuffled around.

Fill Values

If you create a brand-new dataset, you’ll notice that by default it’s zero filled:

>>> dset = f.create_dataset('empty', (2,2), dtype=np.int32)

>>> dset[...]

array([[0, 0],

[0, 0]])

For some applications, it’s nice to pick a default value other than 0. You might want to set unmodified elements to -1, or even NaN for floating-point datasets.

HDF5 addresses this with a fill value, which is the value returned for the areas of a dataset that haven’t been written to. Fill values are handled when data is read, so they don’t cost you anything in terms of storage space. They’re defined when the dataset is created, and can’t be changed:

>>> dset = f.create_dataset('filled', (2,2), dtype=np.int32, fillvalue=42)

>>> dset[...]

array([[42, 42],

[42, 42]])

A dataset’s fill value is available on the fillvalue property:

>>> dset.fillvalue

42

Reading and Writing Data

Your main day-to-day interaction with Dataset objects will look a lot like your interactions with NumPy arrays. One of the design goals for the h5py package was to “recycle” as many NumPy metaphors as possible for datasets, so that you can interact with them in a familiar way.

Even if you’re an experienced NumPy user, don’t skip this section! There are important performance differences and implementation subtleties between the two that may trip you up.

Before we dive into the nuts and bolts of reading from and writing to datasets, it’s important to spend a few minutes discussing how Dataset objects aren’t like NumPy arrays, especially from a performance perspective.

Using Slicing Effectively

In order to use Dataset objects efficiently, we have to know a little about what goes on behind the scenes. Let’s take the example of reading from an existing dataset. Suppose we have the (100, 1000)-shape array from the previous example:

>>> dset = f2['big']

>>> dset

<HDF5 dataset "big": shape (100, 1000), type "<f4">

Now we request a slice:

>>> out = dset[0:10, 20:70]

>>> out.shape

(10, 50)

Here’s what happens behind the scenes when we do the slicing operation:

1. h5py figures out the shape (10, 50) of the resulting array object.

2. An empty NumPy array is allocated of shape (10, 50).

3. HDF5 selects the appropriate part of the dataset.

4. HDF5 copies data from the dataset into the empty NumPy array.

5. The newly filled in NumPy array is returned.

You’ll notice that this implies a certain amount of overhead. Not only do we create a new NumPy array for each slice requested, but we have to figure out what size the array object should be, check that the selection falls within the bounds of the dataset, and have HDF5 perform the selection, all before we’ve read a single byte of data.

This leads us to the first and most important performance tip when using datasets: take reasonably sized slices.

Here’s an example: using our (100, 1000)-shape dataset, which of the following do you think is likely to be faster?

# Check for negative values and clip to 0

for ix inxrange(100):

for iy inxrange(1000):

val = dset[ix,iy] # Read one element

if val < 0: dset[ix, iy] = 0 # Clip to 0 if needed

or

# Check for negative values and clip to 0

for ix inxrange(100):

val = dset[ix,:] # Read one row

val[ val < 0 ] = 0 # Clip negative values to 0

dset[ix,:] = val # Write row back out

In the first case, we perform 100,000 slicing operations. In the second, we perform only 100.

This may seem like a trivial example, but the first example creeps into real-world code frequently; using fast in-memory slices on NumPy arrays, it is actually reasonably quick on modern machines. But once you start going through the whole slice-allocate-HDF5-read pipeline outlined here, things start to bog down.

The same applies to writing, although fewer steps are involved. When you perform a write operation, for example:

>>> some_dset[0:10, 20:70] = out*2

The following steps take place:

1. h5py figures out the size of the selection, and determines whether it is compatible with the size of the array being assigned.

2. HDF5 makes an appropriately sized selection on the dataset.

3. HDF5 reads from the input array and writes to the file.

All of the overhead involved in figuring out the slice sizes and so on, still applies. Writing to a dataset one element at a time, or even a few elements at a time, is a great way to get poor performance.

Start-Stop-Step Indexing

h5py uses a subset of the plain-vanilla slicing available in NumPy. This is the most familiar form, consisting of up to three indices providing a start, stop, and step.

For example, let’s create a simple 10-element dataset with increasing values:

>>> dset = f.create_dataset('range', data=np.arange(10))

>>> dset[...]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

One index picks a particular element:

>>> dset[4]

4

Two indices specify a range, ending just before the last index:

>>> dset[4:8]

array([4,5,6,7])

Three indices provide a “step,” or pitch, telling how many elements to skip:

>>> dset[4:8:2]

array([4,6])

And of course you can get all the points by simply providing :, like this:

>>> dset[:]

array([0,1,2,3,4,5,6,7,8,9])

Like NumPy, you are allowed to use negative numbers to “count back” from the end of the dataset, with -1 referring to the last element:

>>> dset[4:-1]

array([4,5,6,7,8])

Unlike NumPy, you can’t pull fancy tricks with the indices. For example, the traditional way to reverse an array in NumPy is this:

>>> a = np.arange(10)

>>> a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> a[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

But if you try it on a dataset, you get the following:

>>> dset[::-1]

ValueError: Step must be >= 1 (got -1)

Multidimensional and Scalar Slicing

By now you’ve gotten used to seeing the expression “...,” which is used as a slice in examples. This object has the built-in name Ellipsis in the Python world. You can use it as a “stand-in” for axes you don’t want to bother specifying:

>>> dset = f.create_dataset('4d', shape=(100, 80, 50, 20))

>>> dset[0,...,0].shape

(80, 50)

And of course you can get the entire contents by using Ellipsis by itself:

>>> dset[...].shape

(100, 80, 50, 20)

There is one oddity we should discuss, which is that of scalar datasets. In NumPy, there are two flavors of array containing one element. The first has shape (1,), and is an ordinary one-dimensional array. You can get at the value inside by slicing or simple indexing:

>>> dset = f.create_dataset('1d', shape=(1,), data=42)

>>> dset.shape

(1,)

>>> dset[0]

42

>>> dset[...]

array([42])

Note, by the way, how using Ellipsis provides an array with one element, whereas integer indexing provides the element itself.

The second flavor has shape () (an empty tuple) and can’t be accessed by indexing:

>>> dset = f.create_dataset('0d', data=42)

>>> dset.shape

()

>>> dset[0]

ValueError: Illegal slicing argument for scalar dataspace

>>> dset[...]

array(42)

Note how using Ellipsis has again returned an array, in this case a scalar array.

How can we get the value itself, without it being wrapped in a NumPy array? It turns out there’s another way to slice into NumPy arrays (and Dataset objects). You can index with a somewhat bizarre-looking empty tuple:

>>> dset[()]

42

So keep these in your toolkit:

1. Using Ellipsis gives you all the elements in the dataset (always as an array object).

2. Using an empty tuple "()" gives you all the elements in the dataset, as an array object for 1D and higher datasets, and as a scalar element for 0D datasets.

NOTE

To make things even more confusing, you may see code in the wild that uses the .value attribute of a dataset. This is a historical wart that is exactly equivalent to doing dataset[()]. It’s long deprecated and not available in modern versions of h5py.

Boolean Indexing

In an earlier example, we used an interesting expression to set negative entries in a NumPy array val to zero:

>>> val[ val < 0 ] = 0

This is an idiom in NumPy made possible by Boolean-array indexing. If val is a NumPy array of integers, then the result of the expression val < 0 is an array of Booleans. Its entries are True where the corresponding elements of val are negative, and False elsewhere. In the NumPy world, this is also known as a mask.

Crucially, in both the NumPy and HDF5 worlds, you can use a Boolean array as an indexing expression. This does just what you’d expect; it selects the dataset elements where the corresponding index entries are True, and de-selects the rest.

In the spirit of the previous example, let’s suppose we have a dataset initialized to a set of random numbers distributed between -1 and 1:

>>> data = np.random.random(10)*2 - 1

>>> data

array([ 0.98885498, -0.28554781, -0.17157685, -0.05227003, 0.66211931,

0.45692186, 0.07123649, -0.40374417, 0.22059144, -0.82367672])

>>> dset = f.create_dataset('random', data=data)

Let’s clip the negative values to 0, by using a Boolean array:

>>> dset[data<0] = 0

>>> dset[...]

array([ 0.98885498, 0. , 0. , 0. , 0.66211931,

0.45692186, 0.07123649, 0. , 0.22059144, 0. ])

On the HDF5 side, this is handled by transforming the Boolean array into a list of coordinates in the dataset. There are a couple of consequences as a result.

First, for very large indexing expressions with lots of True values, it may be faster to, for example, modify the data on the Python side and write the dataset out again. If you suspect a slowdown it’s a good idea to test this.

Second, the expression on the right-hand side has to be either a scalar, or an array with exactly the right number of points. This isn’t quite as burdensome a requirement as it might seem. If the number of elements that meet the criteria is small, it’s actually a very effective way to “update” the dataset.

For example, what if instead of clipping the negative values to zero, we wanted to flip them and make them positive? We could modify the original array and write the entire thing back out to disk. Or, we could modify just the elements we want:

>>> dset[data<0] = -1*data[data<0]

>>> dset[...]

array([ 0.98885498, 0.28554781, 0.17157685, 0.05227003, 0.66211931,

0.45692186, 0.07123649, 0.40374417, 0.22059144, 0.82367672])

Note that the number of elements (five, in this case) is the same on the left- and righthand sides of the preceding assignment.

Coordinate Lists

There’s another feature borrowed from NumPy, with a few modifications. When slicing into a dataset, for any axis, instead of a x:y:z-style slicing expression you can supply a list of indices. Let’s use our 10-element range dataset again:

>>> dset = f['range']

>>> dset[...]

array([0,1,2,3,4,5,6,7,8,9])

Suppose we wanted just elements 1, 2, and 7. We could manually extract them one at a time as dset[1], dset[2], and dset[7]. We could also use a Boolean indexing array with its values set to True at locations 1, 2, and 7.

Or, we could simply specify the elements desired using a list:

>>> dset[ [1,2,7] ]

array([1,2,7])

This may seem trivial, but it’s implemented in a way that is much more efficient than Boolean masking for large datasets. Instead of generating a laundry list of coordinates to access, h5py breaks the selection down into contiguous “subselections,” which are much faster when multiple axes are involved.

NOTE

If you’re seaching for documentation on the exotic NumPy methods of accessing an array, they are collectively called fancy indexing.

Of course, the trade-off is that there are a few differences from the native NumPy coordinate-list slicing approach, which rule out some of the fancier tricks from the NumPy world:

1. Only one axis at a time can be sliced with a list.

2. Repeated elements are not allowed.

3. Indices in the list must be given in increasing order.

Automatic Broadcasting

In a couple of examples so far, we’ve made slicing assignments in which the number of elements on the left- and right-hand sides were not equal. For example, in the Boolean array example:

>>> dset[data<0] = 0

This kind of expression is handled by broadcasting, similar to the built-in NumPy broadcasting that handles such things where arrays are involved. Used judiciously, it can give your application a performance boost.

Let’s consider our (100, 1000)-shape array from earlier; suppose it contained 100 time traces, each 1000 elements long:

>>> dset = f2['big']

>>> dset.shape

(100, 1000)

Now suppose we want to copy the trace at dset[0,:] and overwrite all the others in the file. We might do this with a for loop:

>>> data = dset[0,:]

>>> for idx inxrange(100):

... dset[idx,:] = data

This will work, but it does require us to write the loop, get the boundary conditions right, and of course perform 100 slicing operations.

There’s an even easier way, which exploits the built-in efficient broadcasting of h5py:

>>> dset[:,:] = dset[0,:]

The shape of the array on the righthand side is (1000,); the shape of the selection on the lefthand side is (100, 1000). Since the last dimensions match, h5py repeatedly copies the data across all 100 remaining indices. It’s as efficient as you can get; there’s only one slicing operation, and the remainder of the time is spent writing data to disk.

Reading Directly into an Existing Array

Finally we come full circle back to read_direct, one of the most powerful methods available on the Dataset object. It’s as close as you can get to the “traditional” C interface of HDF5, without getting into the internal details of h5py.

To recap, you can use read_direct to have HDF5 “fill in” data into an existing array, automatically performing type conversion. Last time we saw how to read float32 data into a float64 NumPy array:

>>> dset.dtype

dtype('float32')

>>> out = np.empty((100, 1000), dtype=np.float64)

>>> dset.read_direct(out)

This works, but requires you to read the entire dataset in one go. Let’s pick a more useful example. Suppose we wanted to read the first time trace, at dset[0,:], and deposit it into the out array at out[50,:]. We can use the source_sel and dest_sel keywords, for source selection anddestination selection respectively:

>>> dset.read_direct(out, source_sel=np.s_[0,:], dest_sel=np.s_[50,:])

The odd-looking np.s_ is a gadget that takes slices, in the ordinary array-slicing syntax, and returns a NumPy slice object with the corresponding information.

By the way, you don’t have to match the shape of your output array to the dataset. Suppose our application wanted to compute the mean of the first 50 data points in each time trace, a common scenario when estimating DC offsets in real-world experimental data. You could do this using the standard slicing techniques:

>>> out = dset[:,0:50]

>>> out.shape

(100, 50)

>>> means = out.mean(axis=1)

>>> means.shape

(100,)

Using read_direct this would look like:

>>> out = np.empty((100,50), dtype=np.float32)

>>> dset.read_direct(out, np.s_[:,0:50]) # dest_sel can be omitted

>>> means = out.mean(axis=1)

This may seem like a trivial case, but there’s an important difference between the two approaches. In the first example, the out array is created internally by h5py, used to store the slice, and then thrown away. In the second example, out is allocated by the user, and can be reused for future calls to read_direct.

There’s no real performance difference when using (100, 50)-shape arrays, but what about (10000, 10000)-shape arrays?

Let’s check the real-world performance of this. We’ll create a test dataset and two functions. To keep things simple and isolate just the performance difference related to the status of out, we’ll always read the same selection of the dataset:

dset = f.create_dataset('perftest', (10000, 10000), dtype=np.float32)

dset[:] = np.random.random(10000) # note the use of broadcasting!

def time_simple():

dset[:,0:500].mean(axis=1)

out = np.empty((10000, 500), dtype=np.float32)

def time_direct():

dset.read_direct(out, np.s_[:,0:500])

out.mean(axis=1)

Now we’ll see what effect preserving the out array has, if we were to put the read in a for loop with 100 iterations:

>>> timeit(time_simple, number=100)

14.04414987564087

>>> timeit(time_direct, number=100)

12.045655965805054

Not too bad. The difference is 2 seconds, or about a 14% improvement. Of course, as with all optimizations, it’s up to you how far you want to go. This “simple” approach is certainly more legible. But when performing multiple reads of data with the same shape, particularly with larger arrays, it’s hard to beat read_direct.

NOTE

For historical reasons, there also exists a write_direct method. It does the same in reverse; however, in modern versions of h5py it’s no more efficient than regular slicing assignment. You’re welcome to use it if you want, but there’s no performance advantage.

A Note on Data Types

HDF5 is designed to preserve data in any format you want. Occasionally, this means you may get a file whose contents differ from the most convenient format for processing on your system. One example we discussed before is endianness, which relates to how multibyte numbers are represented. You can store a 4-byte floating-point number, for example, in memory with the least significant byte first (little-endian), or with the most significant byte first (big-endian). Modern Intel-style x86 chips use the little-endian format, but data can be stored in HDF5 in either fashion.

Because h5py doesn’t know whether you intend to process the data you retrieve or ship it off somewhere else, by default data is returned from the file in whatever format it’s stored. In the case of “endianness,” this is mostly transparent because NumPy supports both flavors. However, there are performance implications. Let’s create two NumPy arrays, one little-endian and one big-endian, and see how they perform on an x86 system:

>>> a = np.ones((1000,1000), dtype='<f4') # Little-endian 4-byte float

>>> b = np.ones((1000,1000), dtype='>f4') # Big-endian 4-byte float

>>> timeit(a.mean, number=1000)

1.684128999710083

>>> timeit(b.mean, number=1000)

3.1886370182037354

That’s pretty bad, about a factor of 2. If you’re processing data you got from somebody else, and your application does lots of long-running calculations, it’s worth taking a few minutes to check.

To convert to “native” endianness in this example you basically have three choices: use read_direct with a natively formatted array you create yourself, use the astype context manager, or convert the array manually in place after you read it. For the latter, there’s a quick way to convert NumPy arrays in place without making a copy:

>>> c = b.view("float32")

>>> c[:] = b

>>> b = c

>>> timeit(b.mean, number=1000)

1.697857141494751

This is a general performance issue, not limited to endian considerations. For example, single- versus double-precision floats have performance implications, and even integers can be problematic if you end up using 16-bit integers with code that has values greater than 216. Keep track of your types, and where possible use the features HDF5 provides to do conversion for you.

Resizing Datasets

So far, we’ve established that datasets have a shape and type, which are set when they’re created. The type is fixed and can never be changed. However, the shape can be changed, within certain limits.

Let’s create a new four-element dataset to investigate:

>>> dset = f.create_dataset('fixed', (2,2))

Looking at the attributes of dset, we see in addition to the .shape attribute, there’s an odd one called maxshape:

>>> dset.shape

(2, 2)

>>> dset.maxshape

(2, 2)

There’s also a resize method on the Dataset object. Let’s see what happens if we try to shrink our dataset from (2,2) to (1,1):

>>> dset.resize((1,1))

TypeError: Only chunked datasets can be resized

Evidently something is missing. How can we make a dataset resizable?

Creating Resizable Datasets

When you create a dataset, in addition to setting its shape, you have the opportunity to make it resizable up to a certain maximum set of dimensions, called its maxshape on the h5py side.

Like shape, maxshape is specified when the dataset is created, but can’t be changed. As you saw earlier, if you don’t explicitly choose a maxshape, HDF5 will create a non-resizable dataset and set maxshape = shape. The dataset will also be created with what’s called contiguousstorage, which prevents the use of resize. Chapter 4 has more information on contiguous versus chunked storage; for now, it’s a detail we can ignore.

Let’s try again, this time specifying a maxshape for the dataset:

>>> dset = f.create_dataset('resizable', (2,2), maxshape=(2,2))

>>> dset.shape

(2, 2)

>>> dset.maxshape

(2, 2)

>>> dset.resize((1,1))

>>> dset.shape

(1, 1)

Success! What happens if we change back?

>>> dset.resize((2,2))

>>> dset.shape

(2, 2)

>>> dset.resize((2,3))

ValueError: unable to set extend dataset (Dataset: Unable to initialize object)

As the name suggests, you can’t make the dataset bigger than maxshape. But this is annoying. What if you don’t know when you create the dataset how big it should be? Should you just provide a very large number in maxshape to get around this limitation?

Thankfully, that isn’t necessary. HDF5 has the concept of “unlimited” axes to deal with this situation. If an axis is declared as “unlimited,” you can make it as big as you want. Simply provide None for that axis in the maxshape tuple to turn this feature on:

>>> dset = f.create_dataset('unlimited', (2,2), maxshape=(2, None))

>>> dset.shape

(2, 2)

>>> dset.maxshape

(2, None)

>>> dset.resize((2,3))

>>> dset.shape

(2, 3)

>>> dset.resize((2, 2**30))

>>> dset.shape

(2, 1073741824)

You can mark as many axes as you want as unlimited.

Finally, no matter what you put in maxshape, you can’t change the total number of axes. This value, the rank of the dataset, is fixed and can never be changed:

>>> dset.resize((2,2,2))

TypeError: New shape length (3) must match dataset rank (2)

Data Shuffling with resize

NumPy has a set of rules that apply when you change the shape of a dataset. For example, take a simple four-element square array with shape (2, 2):

>>> a = np.array([ [1, 2], [3, 4] ])

>>> a.shape

(2, 2)

>>> print a

[[1, 2]

[3, 4]]

If we now resize it to (1, 4), keeping the total number of elements unchanged, the values are still there but rearrange themselves:

>>> a.resize((1,4))

>>> print a

[[1, 2, 3, 4]]

And finally if we resize it to (1, 10), adding new elements, the new ones are initialized to zero:

>>> a.resize((1,10))

>>> print a

[[1 2 3 4 0 0 0 0 0 0]]

If you’ve reshaped NumPy arrays before, you’re likely used to this reshuffling behavior. HDF5 has a different approach. No reshuffling is ever performed. Let’s create a Dataset object to experiment on, which has both axes set to unlimited:

>>> dset = f.create_dataset('sizetest', (2,2), dtype=np.int32, maxshape=(None,

None))

>>> dset[...] = [ [1, 2], [3, 4] ]

>>> dset[...]

array([[1, 2],

[3, 4]])

We’ll try the same resizing as in the NumPy example:

>>> dset.resize((1,4))

>>> dset[...]

array([[1, 2, 0, 0]])

>>> dset.resize((1,10))

>>> dset[...]

array([[1, 2, 0, 0, 0, 0, 0, 0, 0, 0]])

What’s going on here? When we changed the shape from (2, 2) to (1, 4), the data at locations dset[1,0] and dset[1,1] didn’t get reshuffled; it was lost. For this reason, you should be very careful when using resize; the reshuffling tricks you’ve learned in the NumPy world will quickly lead to trouble.

Finally, you’ll notice that in this case the new elements are initialized to zero. In general, they will be set to the dataset’s fill value (see Fill Values).

When and How to Use resize

One of the most common questions about HDF5 is how to “append” to a dataset. With resize, this can be done if care is taken with respect to performance.

For example, let’s say we have another dataset storing 1000-element time traces. However, this time our application doesn’t know how many to store. It could be 10, or 100, or 1000. One approach might be this:

dset1 = f.create_dataset('timetraces', (1,1000), maxshape=(None, 1000))

def add_trace_1(arr):

dset1.resize( (dset1.shape[0]+1, 1000) )

dset1[-1,:] = arr

Here, every time a new 1000-element array is added, the dataset is simply expanded by a single entry. But if the number of resize calls is equal to the number of insertions, this doesn’t scale well, particularly if traces will be added thousands of times.

An alternative approach might be to keep track of the number of insertions and then “trim” the dataset when done:

dset2 = f.create_dataset('timetraces2', (5000, 1000), maxshape=(None, 1000))

ntraces = 0

def add_trace_2(arr):

global ntraces

dset2[ntraces,:] = arr

ntraces += 1

def done():

dset2.resize((ntraces,1000))

In the real world, it takes a little more work than this to get the best performance. We’ve gone about as far as we can go without discussing how the data is actually stored by HDF5. It’s time to talk about storage, and more precisely, chunks, in Chapter 4.