Organizing Data with References, Types, and Dimension Scales - Python and HDF5 (2013)

Python and HDF5 (2013)

Chapter 8. Organizing Data with References, Types, and Dimension Scales

Your files aren’t just a collection of groups, datasets, and attributes. Some of the best features in HDF5 are those that help you to express relationships between pieces of your data.

Maybe one of your datasets provides the x-axis for another, and you’d like to express that in a way your colleagues can easily figure out. Maybe you want to record which regions of a particular dataset are of interest for further processing. Or maybe you just want to store a bunch of links to datasets and groups in the file, without having to worry about getting all the paths right.

This chapter covers three of the most useful constructs in HDF5 for linking your various objects together into a scientifically useful whole. References, the HDF5 “pointer” type, are a great way to store links to objects as data. Named types let you enforce type consistency across datasets. And Dimension Scales, an HDF5 standard, let you attach physically meaningful axes to your data in a way third-party programs can understand.

Let’s get started with the simplest relational widget in HDF5: object references.

Object References

We’ve already seen how links in a group serve to locate objects. But there’s another mechanism that can do this, and crucially, this kind can be stored as data in things like attributes and datasets.

Creating and Resolving References

Let’s create a simple file with a couple of groups and a dataset:

>>> f = h5py.File('refs_demo.hdf5','w')

>>> grp1 = f.create_group('group1')

>>> grp2 = f.create_group('group2')

>>> dset = f.create_dataset('mydata', shape=(100,))

Looking at the group grp1, we notice an interesting property called ref:

>>> grp1.ref

<HDF5 object reference>

The object returned from accessing .ref is an HDF5 object reference. These are basically pointers to objects in the file. You can “dereference” them by using the same syntax as we used for string names:

>>> out = f[grp1.ref]

>>> out == grp1

True

By the way, the Python type for these objects is available at h5py.Reference, in case you want to use isinstance:

>>> isinstance(grp1.ref, h5py.Reference)

True

Since the reference is an “absolute” way of locating an object, you can use any group in the file for dereferencing, not just the root group:

>>> out = grp2[grp1.ref]

>>> out == grp1

True

But keep in mind they’re local to the file. Trying to dereference them in the context of another file will fail:

>>> with h5py.File('anotherfile.hdf5','w') as f2:

... out = f2[grp1.ref]

ValueError: unable dereference object

References as “Unbreakable” Links

So far there seems to be no improvement over using links. But there’s an important difference: you can store them as data, and they’re independent of later renaming of the objects involved.

Here’s an example: suppose we wanted to add an attribute on one of our groups pointing to the dataset mydata. We could simply record the name as an attribute:

>>> grp1.attrs['dataset'] = dset.name

>>> grp1.attrs['dataset']

u'/mydata'

>>> out = f[grp1.attrs['dataset']]

>>> out == dset

True

But if we rename the dataset, this quickly breaks:

>>> f.move('mydata', 'mydata2')

>>> out = f[grp1.attrs['dataset']]

KeyError: "unable to open object"

Using object references instead, we have:

>>> grp1.attrs['dataset'] = dset.ref

>>> grp1.attrs['dataset']

<HDF5 object reference>

>>> out = f[grp1.attrs['dataset']]

>>> out == dset

True

Moving the dataset yet again, the reference still resolves:

>>> f.move('mydata2','mydata3')

>>> out = f[grp1.attrs['dataset']]

>>> out == dset

True

NOTE

When you open an object by dereferencing, every now and then it’s possible that HDF5 won’t be able to figure out the object’s name. In that case, obj.name will return None. It’s less of a problem than it used to be (HDF5 1.8 has gotten very good at figuring out names), but don’t be alarmed if you happen to get None.

References as Data

References are full-fledged types in HDF5; you can freely use them in both attributes and datasets. Obviously there’s no native type in NumPy for references, so we once again call on special_dtype for help, this time with the ref keyword:

>>> dt = h5py.special_dtype(ref=h5py.Reference)

>>> dt

dtype(('|O4', [(({'type': <type 'h5py.h5r.Reference'>}, 'ref'), '|O4')]))

That’s a lot of metadata. But don’t be dismayed; just like variable-length strings, this is otherwise a regular object dtype:

>>> dt.kind

'O'

We can easily create datasets of Reference type:

>>> ref_dset = f.create_dataset("references", (10,), dtype=dt)

What’s in such a dataset? If we retrieve an uninitialized element, we get a zeroed or “null” reference:

>>> out = ref_dset[0]

>>> out

<HDF5 object reference (null)>

Like a null pointer in C, this reference doesn’t point to anything. Like trying to dereference against the wrong file, dereferencing a null reference just results in ValueError:

>>> f[out]

ValueError: Invalid HDF5 object reference

There’s a simple way to check for a null reference, without having to catch exceptions. The truth value of a reference indicates whether or not it’s null:

>>> bool(out)

False

>>> bool(grp1.ref)

True

NOTE

Keep in mind that a value of True doesn’t mean that the reference actually resolves to something, just that it isn’t null. If, for example, you create a reference and then delete the object, the reference will evaluate as True but you will still get ValueError when you attempt to dereference it.

Region References

Region references are one of the coolest features of HDF5. These let you store a reference to part of a dataset. For example, you might want to store a region of interest (ROI) on photographs stored in an HDF5 file, so that during later analysis you don’t have to process the whole thing.

Creating Region References and Reading

You can think of region references as effectively storing your slicing arguments for dataset access. Here’s an example: looking at the dataset created in the previous example, we notice a property named regionref:

>>> dset.name

u'/mydata3'

>>> dset.shape

(100,)

>>> dset.regionref

<h5py._hl.dataset._RegionProxy at 0x459d6d0>

This is a little proxy object which we can use to store our selections. You create a new region reference by applying the standard NumPy slicing syntax to this object:

>>> ref_out = dset.regionref[10:90]

>>> ref_out

<HDF5 region reference>

Like object references, region references are generally opaque. The only useful aspects are the shape of the dataspace (the same as the parent dataset), and the shape of your selection:

>>> dset.regionref.shape(ref_out)(100,)

>>> dset.regionref.selection(ref_out)(80,)

This represents the shape of your selection; in other words, if you had applied your slicing arguments directly to the dataset, it’s the shape of the array that would have been returned from HDF5.

Once you’ve got a region reference, you can use it directly as a slicing argument to retrieve data from the dataset:

>>> data = dset[ref_out]

>>> data

array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

0., 0.], dtype=float32)

>>> data.shape

(80,)

Fancy Indexing

Keep in mind that if you’re using “fancy” indexing methods (like Boolean arrays), then the shape will always be one-dimensional. This mimics the behavior of NumPy for such selections.

For example, suppose we had a little two-dimensional array, which we populated with some random numbers:

>>> dset_random = f.create_dataset('small_example', (3,3))

>>> dset_random[...] = np.random.random((3,3))

>>> dset_random[...]

array([[ 0.32391435, 0.070962 , 0.57038087],

[ 0.1530778 , 0.22476801, 0.7758832 ],

[ 0.75768745, 0.73156554, 0.3228527 ]], dtype=float32)

We could create a Boolean array representing the entries greater than 0.5:

>>> index_arr = dset_random[...] > 0.5

>>> index_arr

array([[False, False, True],

[False, False, True],

[ True, True, False]], dtype=bool)

You can create a region reference from this array:

>>> random_ref = dset_random.regionref[index_arr]

>>> dset_random.regionref.selection(random_ref)

(4,)

There were a total of four elements that matched, so the selection result is “packed” into a four-element 1D buffer.

There is a rule to the order in which such elements are retrieved. If we apply our selection to the dataset:

>>> data = dset_random[random_ref]

>>> data

array([ 0.57038087, 0.7758832 , 0.75768745, 0.73156554], dtype=float32)

Looking closely, it appears that the elements retrieved are at [0,2], [1,2], [2,0], [2,1], in that order. You’ll recognize this as “C” order; the selection advances through the last index, then the next to last, and so on.

NOTE

Unfortunately, list-based selections will also be returned as 1D arrays, unlike NumPy (try it!). This is a limitation of the HDF5 library.

Finding Datasets with Region References

There’s one more trick region references can do. If you have a region reference, say our shape-(80,) selection ref_out from earlier, you can use it as an object reference to retrieve the dataset:

>>> f[ref_out]

<HDF5 dataset "mydata3": shape (100,), type "<f4">

This can come in handy when you’ve stored a region reference as an attribute somewhere. It means you don’t have to also store an object reference to figure out where to apply the selection.

And if you’re just after the data and don’t care about the dataset itself, all you have to do is:

>>> selected_data = f[ref_out][ref_out]

Named Types

There’s one more “linking” concept in HDF5, and it’s much more subtle than either object or region references. We’ve already seen that when you create a dataset (or an attribute), it’s created with a fixed data type. Suppose you have multiple data products in a file (for example, many datasets containing image data), and you want to be sure each has exactly the same type.

HDF5 provides a native way to ensure this, by allowing you to save a data type to the file independently of any particular dataset or attribute. When you call create_dataset, you supply the stored type and HDF5 will “link” the type to the brand new dataset.

The Datatype Object

You can create such an independent, or “named” type, by simply assigning a NumPy dtype to a name in the file:

>>> f['mytype'] = np.dtype('float32')

When we open the named type, we don’t get a dtype back, but something else:

>>> out = f['mytype']

>>> out

<HDF5 named type "mytype" (dtype <f4)>

Like the Dataset object, this h5py.Datatype object is a thin proxy that allows access to the underlying HDF5 datatype. The most immediately obvious property is Datatype.dtype, which returns the equivalent NumPy dtype object:

>>> out.dtype

dtype('float32')

Since they’re full-fledged objects in the file, you have a lot of other properties as well:

>>> out.name

u'/mytype'

>>> out.parent

<HDF5 group "/" (6 members)>

Also available are .file (h5py.File instance containing the type), .ref (object reference to the type), and attributes, just like Dataset and Group objects:

>>> out.attrs['info'] = "This is an attribute on a named type object"

NOTE

In the HDF5 world, for technical reasons named types are now called committed types. You may hear both terms; for our purposes, they mean the same thing.

Linking to Named Types

It’s simple to create a dataset or attribute that refers to a named type object; just supply the Datatype instance as the dtype:

>>> dset = f.create_dataset("typedemo", (100,), dtype=f['mytype'])

When you do this, HDF5 doesn’t copy the type into the dataset; it actually makes a reference to the named type object elsewhere in the file. So on top of helping you keep data organized, it saves disk space as well.

For attributes, remember you’ll have to explicitly supply the type via the create method. For example, to create an attribute on the root group that uses our named type:

>>> f.attrs.create("attribute_demo", 1.0, dtype=f['mytype'])

Managing Named Types

You can’t modify named types once they’re created, although you can unlink them from the file:

>>> del f['mytype']

But don’t try to trick HDF5 by replacing it with a new, different type. The stored type information can’t be unlinked from the datasets and attributes that point to it. It’ll hang around in the shadows until every dataset and attribute that used it is deleted from the file:

>>> f['mytype'] = np.dtype('int16')

>>> dset = f['typedemo']

>>> dset.dtype

dtype('float32')

Dimension Scales

Real-world data comes with units attached. Suppose we have a 3D dataset that is the output from an atmospheric simulation, representing temperatures at various points within a volume:

>>> dset = f.create_dataset('temperatures', (100,100,100), dtype='f')

It’s easy enough to recognize that dataset records “temperature”—we could even add an attribute to record the scale:

>>> dset.attrs['temp_units'] = "C"

But there’s a more subtle problem. Suppose our simulation is focused on a convection process, and has much greater resolution in the z direction than either the x or y directions; for example, the steps in the vertical direction might be 100 meters while the steps in the horizontal direction might be 10 kilometers. We could add more attributes, perhaps by having a “step” attribute like so:

>>> dset.attrs['steps'] = [10000,10000,100]

By the way, which axis is which? Does the first axis represent the x direction, as we might expect? Or does the simulation output z first? We could add another attribute, I guess, which records this:

>>> dset.attrs['axes'] = ["x", "y", "z"]

We would have to tell all of our colleagues about this convention, of course, and it would look “squished” in viewers that don’t know about our ad hoc naming convention, and I suppose if we change to variable-sized steps in the z direction it would break…

It turns out this situation is common enough that a standard has arisen: the HDF5 Dimension Scales specification. Like the ad hoc system of attributes earlier, it’s a feature built on top of HDF5, using the machinery of datasets, attributes, and references in a standardized way to build a more expressive object.

There are a lot of features like this, all standardized by the HDF Group (authors and maintainers of the HDF5 software suite) in a series of RFCs. The big advantage of this approach is that third-party applications know what to do with certain specific combinations of groups, datasets, attributes, and references. For example, a viewer program would render our simulation “unsquished” by honoring the axes we create. Or, in the case of the HDF5 Image standard, the viewer could determine the appropriate palette to render an astronomical photograph.

Creating Dimension Scales

Let’s revisit our dataset with 3D temperature measurements. First, we’ll erase the attributes we applied, before using Dimension Scales to do it right:

>>> for name indset.attrs:

... del dset.attrs[name]

There’s a property attached to our dataset, which up until now we’ve ignored:

>>> dset.dims

<Dimensions of HDF5 object at 74374048>

This is our entry point for Dimension Scales. In HDF5, a “dimension scale” is a separate “axis” dataset with some metadata, linked to the main dataset using references. In our example, we want to create three dimension scales: one for x, with steps of 10 km, one for y, also with steps of 10 km, and one for z, in steps of 100 m.

To record this, we first create three datasets in the file which will hold our Dimension Scale “axes”:

>>> f.create_dataset('scale_x', data=np.arange(100)*10e3)

>>> f.create_dataset('scale_y', data=np.arange(100)*10e3)

>>> f.create_dataset('scale_z', data=np.arange(100)*100.0)

Now, we ask HDF5 to turn them into official “Dimension Scale” datasets by using the create_scale method on dset.dims:

>>> dset.dims.create_scale(f['scale_x'], "Simulation X (North) axis")

>>> dset.dims.create_scale(f['scale_y'], "Simulation Y (East) axis")

>>> dset.dims.create_scale(f['scale_z'], "Simulation Z (Vertical) axis")

It’s worth taking a moment to see what actually happens when we do this. Let’s inspect the attributes of the scale_x dataset and see what’s there:

>>> for key, val inf['scale_x'].attrs.iteritems():

... print key, ':', val

CLASS : DIMENSION_SCALE

NAME : Simulation X (North) axis

That’s really it. All create_scale did was attach a few attributes with standardized names and values.

Attaching Scales to a Dataset

Now that we have our three scales, we can associate them with our dataset. Note, however, that we have to associate each scale with a particular axis of the dataset. This is expressed by using indexing on the Dataset.dims object:

>>> dset.dims[0].attach_scale(f['scale_x'])

>>> dset.dims[1].attach_scale(f['scale_y'])

>>> dset.dims[2].attach_scale(f['scale_z'])

The object at dims[N] is yet another little proxy, in this case keeping track of which dimension scales are attached to the first axis of the dataset. Yes, you can have multiple scales attached to a single axis! Good news for those of you who create plots with an axis on every side.

The dims[N] proxy works like an ordered dictionary and supports access by both name and index. In this case, the index refers to the order in which scales were added. For example, to get the dataset containing our x scale, we could ask for the first scale associated with dimension 0 of the dataset:

>>> dset.dims[0][0]

<HDF5 dataset "scale_x": shape (100,), type "<f8">

And to get the actual axis values, simply slice into the dataset:

>>> dset.dims[0][0][...]

array([ 0., 10000., 20000., 30000., 40000., 50000.,

60000., 70000., 80000., 90000., 100000., 110000.,

120000., 130000., 140000., 150000., 160000., 170000.,

180000., 190000., 200000., 210000., 220000., 230000.,

240000., 250000., 260000., 270000., 280000., 290000.,

300000., 310000., 320000., 330000., 340000., 350000.,

360000., 370000., 380000., 390000., 400000., 410000.,

420000., 430000., 440000., 450000., 460000., 470000.,

480000., 490000., 500000., 510000., 520000., 530000.,

540000., 550000., 560000., 570000., 580000., 590000.,

600000., 610000., 620000., 630000., 640000., 650000.,

660000., 670000., 680000., 690000., 700000., 710000.,

720000., 730000., 740000., 750000., 760000., 770000.,

780000., 790000., 800000., 810000., 820000., 830000.,

840000., 850000., 860000., 870000., 880000., 890000.,

900000., 910000., 920000., 930000., 940000., 950000.,

960000., 970000., 980000., 990000.])

We could also use dictionary-style access using the name we supplied when creating the scale:

>>> dset.dims[0]["Simulation X (North) axis"]

<HDF5 dataset "scale_x": shape (100,), type "<f8">

There are a couple of other dictionary-like methods too, including items, keys, and values.

Finally, you can label each axis of the dataset. This is the correct place to record which axis is x, y, z, etc.:

>>> dset.dims[0].label = "x"

>>> dset.dims[1].label = "y"

>>> dset.dims[2].label = "z"

Now that we’ve covered all the basic constructs, it’s time to talk about one of the thorniest issues when programming with HDF5: concurrency.