Groups, Links, and Iteration: The “H” in HDF5 - Python and HDF5 (2013)

Python and HDF5 (2013)

Chapter 5. Groups, Links, and Iteration: The “H” in HDF5

So far we’ve seen how to create Dataset objects by giving them a name in the file like myfile["dataset1"] or myfile["dataset2"]. Unless you’re one of those people who stores all their documents on the desktop, you can probably see the flaw in this approach.

Groups are the HDF5 container object, analagous to folders in a filesystem. They can hold datasets and other groups, allowing you to build up a hierarchical structure with objects neatly organized in groups and subgroups.

The Root Group and Subgroups

You may have guessed by now that the File object is itself a group. In this case, it also serves as the root group, named /, our entry point into the file.

The more general group object is h5py.Group, of which h5py.File is a subclass. Other groups are easily created by the method create_group:

>>> f = h5py.File("Groups.hdf5")

>>> subgroup = f.create_group("SubGroup")

>>> subgroup

<HDF5 group "/SubGroup" (0 members)>

>>> subgroup.name

u'/SubGroup'

Of course, groups can be nested also. The create_group method exists on all Group objects, not just File:

>>> subsubgroup = subgroup.create_group("AnotherGroup")

>>> subsubgroup.name

u'/SubGroup/AnotherGroup'

By the way, you don’t have to manually create nested groups one at a time. If you supply a full path, HDF5 will create all the intermediate groups for you:

>>> out = f.create_group('/some/big/path')

>>> out

<HDF5 group "/some/big/path" (0 members)>

The same goes for creating datasets; just supply the full path you want and HDF5 will fill in the missing pieces.

Group Basics

If you remember nothing else from this chapter, remember this: groups work mostly like dictionaries. There are a couple of holes in this abstraction, but on the whole it works surprisingly well. Groups are iterable, and have a subset of the normal Python dictionary API.

Let’s add another few objects to our file for the examples that follow:

>>> f["Dataset1"] = 1.0

>>> f["Dataset2"] = 2.0

>>> f["Dataset3"] = 3.0

>>> subgroup["Dataset4"] = 4.0

Dictionary-Style Access

You got a hint of this dictionary-like behavior from the syntax group[name] = object. Objects can be retrieved from a group by name:

>>> dset1 = f["Dataset1"]

Unlike normal Python dictionaries, you can also use POSIX-style paths to directly access objects in subgroups, without having to tediously open all the groups between here and there:

>>> dset4 = f["SubGroup/Dataset4"] # Right

>>> dset4 = f["SubGroup"]["Dataset4"] # Works, but inefficient

Attempting to access an empty group raises KeyError, although one irritating thing about h5py is that you don’t get the name of the missing object in the exception:

>>> f['BadName']

KeyError: "unable to open object (Symbol table: Can't open object)"

There’s also the familar get method, which is handy if you don’t want to raise an exception:

>>> out = f.get("BadName")

>>> print out

None

You can take the length of a group—note that this measures the number of objects directly attached to the group rather than all the objects in nested subgroups:

>>> len(f)

5

>>> len(f["SubGroup"])

2

Pythonic iteration is also supported using the familiar iteritems() and friends (see Iteration and Containership).

Special Properties

There are a few widgets attached to groups (and datasets) that are very useful when working with the hierarchy in a file.

First is the .file property. Attached to every object, this gives you a handy way to retrieve a File object for the file in which your object resides:

>>> f = h5py.File('propdemo.hdf5','w')

>>> grp = f.create_group('hello')

>>> grp.file == f

True

This is great when you want to check whether a file is read/write, or just get the filename.

Second is the .parent property. This returns the Group object that contains your object:

>>> grp.parent

<HDF5 group "/" (1 members)>

With these two properties, you can avoid most of the path-formatting headaches associated with filesystem work.

Working with Links

What does it mean to give an object a name in the file? From the preceding examples, you might think that the name is part of the object, in the same way that the dtype or shape are part of a dataset.

But this isn’t the case. There’s a layer between the group object and the objects that are its members. The two are related by the concept of links.

Hard Links

Links in HDF5 are handled in much the same way as in modern filesystems. Objects like datasets and groups don’t have an intrinsic name; rather, they have an address (byte offset) in the file that HDF5 has to look up. When you assign an object to a name in a group, that address is recorded in the group and associated with the name you provided to form a link.

Among other things, this means that objects in an HDF5 file can have more than one name; in fact, they have as many names as there exist links pointing to them. The number of links that point to an object is recorded, and when no more links exist, the space used for the object is freed.

This kind of a link, the default in HDF5, is called a hard link to differentiate it from other kinds of links discussed later in this chapter.

Here’s an example of the multiple-name behavior. We’ll create a simple file containing a group, and create a hard link to it at /x:

>>> f = h5py.File('linksdemo.hdf5','w')

>>> grpx = f.create_group('x')

>>> grpx.name

u'/x'

Now we’ll create a second link pointing to the group. You can do this using standard Python dictionary-style item assignment:

>>> f['y'] = grpx

When we retrieve an object from location /y, we get the same group back:

>>> grpy = f['y']

>>> grpy == grpx

True

You might wonder what happens to the .name property if objects in a file don’t have a unique name. Let’s see:

>>> grpx.name

u'/x'

>>> grpy.name

u'/y'

HDF5 makes a best effort to return the name used to retrieve an object. But there’s no guarantee. If the object has a name, then you’ll generally get one when accessing .name, but it may not be the one you expect.

What does that mean, “if the object has a name”? It’s perfectly legal in HDF5 to create an object without a name; just supply None:

>>> grpz = f.create_group(None)

>>> print grpz.name

None

The group grpz in this case exists in the file, but there’s no way to get there from the root group. If we were to get rid of the Python object grpz, the group would be deleted and the space in the file reclaimed. To avoid this we can simply link the group into the file structure at our leisure:

>>> f['z'] = grpz

>>> grpz.name

u'/z'

The multiple names issue also affects the behavior of the .parent property. To address this, in h5py, obj.parent is defined to be the “parent” object according to obj.name. For example, if obj.name is /foo/bar, obj.parent.name will be /foo.

One way to express this is with the posixpath package built into Python:

>>> import posixpath

>>> parent = obj.file[posixpath.dirname(obj.name)]

To remove links, we use the dictionary-style syntax del group[name]:

>>> del f['y']

Once all hard links to an object are gone (and the object isn’t open somewhere in Python), it’s destroyed:

>>> del f['x'] # Last hard link; the group is deleted in the file

Free Space and Repacking

When an object (for example, a large dataset) is deleted, the space it occupied on disk is reused for new objects like groups and datasets. However, at the time of writing, HDF5 does not track such “free space” across file open/close cycles. So if you don’t end up reusing the space by the time you close the file, you may end up with a “hole” of unusable space in the file that can’t be reclaimed.

This issue is a high development priority for the HDF Group. In the meantime, if your files seem unusually large you can “repack” them with the h5repack tool, which ships with HDF5:

$ h5repack bigfile.hdf5 out.hdf5

Soft Links

Those of you who have used Linux or Mac OS X will be familiar with “soft” links. Unlike “hard” links, which associate a link name with a particular object in the file, soft links instead store the path to an object.

Here’s an example. Let’s create a file and populate it with a single group containing a dataset:

>>> f = h5py.File('test.hdf5','w')

>>> grp = f.create_group('mygroup')

>>> dset = grp.create_dataset('dataset', (100,))

If we were to create a hard link in the root group to the dataset, it would always point to that particular object, even if the dataset were moved or unlinked from mygroup:

>>> f['hardlink'] = dset

>>> f['hardlink'] == grp['dataset']

True

>>> grp.move('dataset', 'new_dataset_name')

>>> f['hardlink'] == grp['new_dataset_name']

True

Let’s move the dataset back, and then create a soft link that points to the path /mygroup/dataset. To tell HDF5 that we want to create a soft link, assign an instance of the class h5py.SoftLink to a name in the file:

>>> grp.move('new_dataset_name', 'dataset')

>>> f['softlink'] = h5py.SoftLink('/mygroup/dataset')

>>> f['softlink'] == grp['dataset']

True

SoftLink objects are very simple; they only have one property, .path, holding the path provided when they are created:

>>> softlink = h5py.SoftLink('/some/path')

>>> softlink

<SoftLink to "/some/path">

>>> softlink.path

'/some/path'

Keep in mind that instances of h5py.SoftLink are purely a Python-side convenience, not a wrapper around anything in HDF5. Nothing happens until you assign one of them to a name in the file.

Returning to our example, since only the path is stored, if we move the dataset and replace it with something else, /softlink would then point to the new object:

>>> grp.move('dataset', 'new_dataset_name')

>>> dset2 = grp.create_dataset('dataset', (50,))

>>> f['softlink'] == dset

False

>>> f['softlink'] == dset2

True

Soft links are therefore a great way to refer to “the object which resides at /some/particular/path,” rather than any specific object in the file. This can be very handy if, for example, a particular dataset represents some information that needs to be updated without breaking all the links to it elsewhere in the file.

The value of a soft link is not checked when it’s created. If you supply an invalid path (or the object is later moved/deleted), accessing will fail with an exception. Because of how HDF5 reports such an error, you will get the same exception as you would when trying to access a nonexistent name in the group, KeyError:

>>> f['broken'] = h5py.SoftLink('/some/nonexistent/object')

>>> f['broken']

KeyError: "unable to open object"

By the way, since soft links only record a path, they don’t participate in the reference counting that hard links do. So if you have a soft link /softlink pointing at an object hard-linked at /a, if you delete the object (del f["/a"]) it will be destroyed and the soft link will simply break.

NOTE

You may be wondering what happens when a broken soft link appears in items() or values(). The answer is that object None is used as the value instead.

External Links

Starting with HDF5 1.8, there’s an additional type of link in addition to the file-local hard and soft links. External links allow you to refer to objects in other files. They’re one of the coolest features of HDF5, but one of the most troublesome to keep track of because of their transparency.

An external link has two components: the name of a file, and then the (absolute) name of an object within that file. Like soft links, you create them with a “marker” object, in this case an instance of h5py.ExternalLink.

Let’s create a file with a single object inside, and then link to it from another file:

>>> with h5py.File('file_with_resource.hdf5', 'w') as f1:

... f1.create_group('mygroup')

>>> f2 = h5py.File('linking_file.hdf5', 'w')

>>> f2['linkname'] = h5py.ExternalLink('file_with_resource.hdf5', 'mygroup')

Like soft links, external links are transparent, in the sense that if they’re not broken, we get back the group or dataset they point to instead of some intermediate object. So if we access the link we just created, we get a group back:

>>> grp = f2['linkname']

>>> grp.name

u'/mygroup'

However, if we look more closely we discover that this object resides in a different file:

>>> grp.file

<HDF5 file "file_with_resource.hdf5" (mode r+)>

>>> f2

<HDF5 file "linking_file.hdf5" (mode r+)>

Keep in mind that this can lead to odd-looking consequences. For example, when you use the .parent property on the retrieved object, it points to the root group of the external file, not the file in which the link resides:

>>> f2['/linkname'].parent == f2['/']

False

Both the file and object names are checked when you create the external link. So if HDF5 can’t find the file, or the specified object within the file, you’ll get an exception:

>>> f2['anotherlink'] = h5py.ExternalLink('missing.hdf5','/')

ValueError: unable to create link (Links: Unable to initialize object)

The two main hazards when dealing with external links are (1) that the file the link points to won’t be around when it’s accessed, and (2) by simply traversing the links in the file, you can “wander” into a different file.

There’s not too much we can do about (1); it’s up to you to keep files organized and be mindful of what links to what. Hazard (2) is a little more dangerous, particularly since all the “Pythonic” methods of accessing group members, like iteration, items(), and so on, will include external links. If it’s undesirable for your application to cross file boundaries, be sure to check the .file property to see where the objects actually reside.

At the moment, there’s no way to set a “search path” from h5py. When an external link is encountered, HDF5 will first look for the destination file in the same directory as the file with the link, and then in the current working directory. That’s it.

A Note on Object Names

You might have noticed that when you retrieve the name of an object, it comes out as a Python Unicode object:

>>> f['/foo'].name

u'/foo'

This is intentional, and following the improved Unicode support of HDF5 1.8 (and Python 3) it was introduced in h5py version 2.0. Object names in the file are always treated as “text” strings, which means they represent sequences of characters. In contrast, “byte” strings are sequences of 8-bit numbers that can often but not always store ASCII or Latin-1 encoded text.

The great thing about this is that international characters are supported for all object names in the file; you don’t have to “ASCII-ize” anything to fit it into the HDF5 system. Names are stored using the most-compatible storage strategy possible for maximum compatibility with older versions (1.6) of HDF5.

To take advantage of this, simply supply a Unicode string when creating an object or making a new link:

>>> grp = f.create_group(u'e_with_accent_\u00E9')

>>> print grp.name

/e_with_accent_é

On the backend, h5py converts your string to the HDF5-approved UTF-8 encoding before storing it. When you supply a “regular” or “byte” string (as in most of the previous examples), h5py uses your string as is. It’s technically possible to store non-UTF-8 strings like this, although such use is strongly discouraged. If you do happen to receive a file with such “noncompliant” object names, h5py will simply pass back the raw byte string and let you figure it out.

Using get to Determine Object Types

We mentioned that the familiar dictionary-style method get was also available on Group objects, and showed how to handle missing group members without raising KeyError. But this version of get is a little more capable than the Python get.

There are two additional keywords in addition to the dictionary-style default: getclass and getlink. The getclass keyword lets you retrieve the type of an object without actually having to open it. At the HDF5 level, this only requires reading some metadata and is consequently very fast.

Here’s an example: first we’ll create a file containing a single group and a single dataset:

>>> f = h5py.File('get_demo.hdf5','w')

>>> f.create_group('subgroup')

>>> f.create_dataset('dataset', (100,))

Using get, the type of object can be retrieved:

>>> for name inf:

... print name, f.get(name, getclass=True)

dataset <class 'h5py._hl.dataset.Dataset'>

subgroup <class 'h5py._hl.group.Group'>

The second keyword, getlink, lets you determine the properties of the link involved:

>>> f['softlink'] = h5py.SoftLink('/subgroup')

>>> with h5py.File('get_demo_ext.hdf5','w') as f2:

... f2.create_group('egroup')

>>> f['extlink'] = h5py.ExternalLink('get_demo_ext.hdf5','/egroup')

>>> for name inf:

... print name, f.get(name, getlink=True)

dataset <h5py._hl.group.HardLink object at 0x047277F0>

extlink <ExternalLink to "/egroup" in file "get_demo_ext.hdf5"

softlink <SoftLink to "/subgroup">

subgroup <h5py._hl.group.HardLink object at 0x047273B0>

You’ll notice that instances of SoftLink and ExternalLink were returned, complete with path information. This is the official way to retrieve such information after the link is created.

For the hard links at subgroup and dataset, there’s also an instance of something called h5py.HardLink. This exists solely to support the use of get; it has no other function and no properties or methods.

Finally, if all you care about is the kind of link involved, and not the exact values of the paths and files involved, you can combine the getclass and getlink keywords to return the link class:

>>> for name inf:

... print name, f.get(name, getclass=True, getlink=True)

dataset <class 'h5py._hl.group.HardLink'>

extlink <class 'h5py._hl.group.ExternalLink'>

softlink <class 'h5py._hl.group.SoftLink'>

subgroup <class 'h5py._hl.group.HardLink'>

NOTE

For many of the classes involved here, you may notice that they were originally defined in the subpackage h5py._hl, for example h5py._hl.group.SoftLink shown earlier. This is an implementation detail that may change; when doing isinstance checks, etc., use the names directly attached to the h5py package (e.g., h5py.SoftLink).

Using require to Simplify Your Application

Unlike Python dictionaries, you can’t directly overwrite the members of a group:

>>> f = h5py.File('require_demo.hdf5','w')

>>> f.create_group('x')

>>> f.create_group('y')

>>> f.create_group('y')

ValueError: unable to create group (Symbol table: Unable to initialize object)

This also holds true for manually hard-linking objects:

>>> f['y'] = f['x']

ValueError: unable to create link (Links: Unable to initialize object)

This is an intentional feature designed to prevent data loss. Since objects are immediately deleted when you unlink them from a group, you have to explicitly delete the link rather than having HDF5 do it for you:

>>> del f['y']

>>> f['y'] = f['x']

This leads to some headaches in real-world code. For example, a fragment of analysis code might create a file and write the results to a dataset:

>>> data = do_large_calculation()

>>> with h5py.File('output.hdf5') as f:

... f.create_dataset('results', data=data)

If there are many datasets and groups in the file, it might not be appropriate to overwrite the entire file every time the code runs. But if we don’t open in w mode, then our program will only work the first time, unless we manually remove the output file every time it runs.

To deal with this, create_group and create_dataset have companion methods called require_group and require_dataset. They do exactly the same thing, only first they check for an existing group or dataset and return it instead.

Both versions take exactly the same arguments and keywords. In the case of require_dataset, h5py also checks the requested shape and dtype against any existing dataset and fails if they don’t match:

>>> f.create_dataset('dataset', (100,), dtype='i')

>>> f.require_dataset('dataset', (100,), dtype='f')

TypeError: Datatypes cannot be safely cast (existing int32 vs new f)

There’s a minor detail here, in that a conflict is only deemed to occur if the shapes don’t match, or the requested precision of the datatype is higher than the existing precision. So if there’s a preexisting int64 dataset, then require_dataset will succeed if int32 is requested:

>>> f.create_dataset('int_dataset', (100,), dtype='int64')

>>> f.require_dataset('int_dataset', (100,), dtype='int32')

The NumPy casting rules are used to check for conflicts; you can test the types yourself using np.can_cast.

Iteration and Containership

Iteration is a core Python concept, key to writing “Pythonic” code that runs quickly and that your colleagues can understand. It’s also a natural way to explore the contents of groups.

How Groups Are Actually Stored

In the HDF5 file, group members are indexed using a structure called a “B-tree.” This isn’t a computer science text, so we won’t spend too long on the subject, but it’s valuable to have a rough understanding of what’s going on behind the scenes, especially if you’re dealing with groups that have thousands or hundreds of thousands of items.

“B-trees” are data structures that are great for keeping track of large numbers of items, while still making retrieval (and addition) of items fast. They work by taking a collection of items, each of which is orderable according to some scheme like a string name or numeric identifier, and building up a tree-like “index” to rapidly retrieve an item.

For example, if you have an HDF5 group with a single member, and another group with a million members, it doesn’t take a million times as long to open an object in the latter group. Group members are indexed by name, so if you know the name of an object then HDF5 can traverse the index and quickly retrieve the item. The same is true when creating a new group member; HDF5 doesn’t have to “insert” the member into the middle of a big table somewhere, shuffling all the entries around.

Of course, all of this is transparent to the user. Every group in an HDF5 file comes with an index that tracks members in alphabetical order. Keep in mind this means “C-style” alphabetical order (whimsically called “ASCIIbetical” order):

>>> f = h5py.File('iterationdemo.hdf5','w')

>>> f.create_group('1')

>>> f.create_group('2')

>>> f.create_group('10')

>>> f.create_dataset('data', (100,))

>>> f.keys()

[u'1', u'10', u'2', u'data']

Files can also contain other optional indices, for example those that track object creation time, but h5py doesn’t expose them.

This brings us to the first point: h5py will generally iterate over objects in the file in alphabetical order (especially for small groups), but you shouldn’t rely on this behavior. Behind the scenes, HDF5 is actually retrieving objects in so-called native order, which basically means “as fast as possible.” The only thing that’s guaranteed is that if you don’t modify the group, the order will remain the same.

Dictionary-Style Iteration

In keeping with the general convention that groups work like dictionaries, iterating over a group in HDF5 provides the names of the members. Remember, these will be supplied as Unicode strings:

>>> [x for x inf]

[u'1', u'10', u'2', u'data']

There are also iterkeys (equivalent to the preceding), itervalues, and iteritems methods, which do just what you’d expect:

>>> [y for y inf.itervalues()]

[<HDF5 group "/1" (0 members)>,

<HDF5 group "/10" (0 members)>,

<HDF5 group "/2" (0 members)>,

<HDF5 dataset "data": shape (100,), type "<f4">]

>>> [(x,y) for x, y inf.iteritems()]

[(u'1', <HDF5 group "/1" (0 members)>),

(u'10', <HDF5 group "/10" (0 members)>),

(u'2', <HDF5 group "/2" (0 members)>),

(u'data', <HDF5 dataset "data": shape (100,), type "<f4">)]

There are also the standard keys, items, and values methods, which produce lists equivalent to the three preceding examples. This brings us to the first performance tip involving iteration and groups: unless you really want to produce a list of the 10,000 objects in your group, use theiter* methods.

NOTE

If you’re using Python 3, you’ll notice that you have only the keys, values, and items methods. That’s OK; like dictionaries, under Python 3 these return iterables, not lists.

Containership Testing

This is another seemingly obvious performance issue that crops up from time to time. If you’re writing code like this, DON’T:

>>> if 'name' ingroup.keys():

This creates and throws away a list of all your group members every time you use it. By instead using the standard Python containership test, you can leverage the underlying HDF5 index on object names, which will go very, very fast:

>>> if 'name' ingroup:

Critically, you can also use paths spanning several groups, although it’s very slightly slower since the intermediate groups have to be inspected by HDF5:

>>> if 'some/big/path' ingroup:

Very handy. Keep in mind that like accessing group members, the POSIX-style “parent directory” symbol ".." won’t work. You won’t even get an error message; HDF5 will look for a group named ".." and determine it’s not present:

>>> '../1' inf['/1']

False

If you’re manipulating POSIX-style strings and run into this problem, consider “normalizing” your paths using the posixpath package:

>>> grp = f['/1']

>>> path = "../1"

>>> import posixpath as pp

>>> path = pp.normpath(pp.join(grp.name, path))

>>> path

u'/1'

>>> path ingrp

True

Multilevel Iteration with the Visitor Pattern

Basic iteration works fine for the contents of a single group. But what if you want to iterate over every single object in the file? Or all objects “below” a certain group?

In the HDF5 world, this is accomplished by visitor iteration. Rather than HDF5 supplying you with an iterable, you provide a callable and HDF5 calls it with an argument or two for every object.

Visit by Name

Your entry point is the visit method on the Group class. Let’s create a simple file to test it out:

>>> f = h5py.File('visit_test.hdf5', 'w')

>>> f.create_dataset('top_dataset', data=1.0)

>>> f.create_group( 'top_group_1' )

>>> f.create_group( 'top_group_1/subgroup_1' )

>>> f.create_dataset('top_group_1/subgroup_1/sub_dataset_1', data=1.0)

>>> f.create_group( 'top_group_2' )

>>> f.create_dataset('top_group_2/sub_dataset_2', data=1.0)

We can supply any callable to visit, which takes one argument, the object name:

>>> def printname(name):

... print name

>>> f.visit(printname)

top_dataset

top_group_1

top_group_1/subgroup_1

top_group_1/subgroup_1/sub_dataset_1

top_group_2

top_group_2/sub_dataset_2

No particular order is guaranteed, except that when visit enters a subgroup, all the members will be visited before moving on to the next subgroup. For example, everything under top_group_1 is listed together, and so is everything under top_group_2.

You’re not required to visit the entire file; visit works just fine on subgroups:

>>> grp = f['top_group_1']

>>> grp.visit(printname)

subgroup_1

subgroup_1/sub_dataset_1

The visitor pattern is a little different from standard Python iteration, but is quite powerful once you get used to it. For example, here’s a simple way to get a list of every single object in the file:

>>> mylist = []

>>> f.visit(mylist.append)

NOTE

As with all object names in the file, the names supplied to visit are “text” strings (unicode on Python 2, str on Python 3). Keep this in mind when writing your callbacks.

Multiple Links and visit

Of course, we know that an HDF5 file is not just a simple tree. Hard links are a great way to share objects between groups. But how do they interact with visit?

Let’s add a hard link to the subgroup we just explored (top_group_1), and run visit again to see what happens:

>>> grp['hardlink'] = f['top_group_2']

>>> grp.visit(printname)

hardlink

hardlink/sub_dataset_2

subgroup_1

subgroup_1/sub_dataset_1

Not bad. The group at /top_group_2 is effectively “mounted” in the file at /top_group_1/hardlink, and visit explores it correctly.

Now let’s try something a little different. We’ll undo that last hard link, and try to trick visit into visiting sub_dataset_1 twice:

>>> del grp['hardlink']

>>> grp['hardlink_to_dataset'] = grp['subgroup_1/sub_dataset_1']

>>> grp.visit(printname)

hardlink_to_dataset

subgroup_1

What happened? We didn’t see sub_dataset_1 in the output this time.

By design, each object in a file will be visited only once, regardless of how many links exist to the object. Among other things, this eliminates the possibility of getting stuck in an endless loop, as might happen if some clever person were to try the following:

>>> f['/root'] = f['/']

There is a trade-off. As we saw in our initial discussion of hard links, there’s no such thing as the “original” or “real” name for an object. So if multiple links point to your dataset, when visit supplies a name it may not be the one you expect.

Visiting Items

Given the name supplied to your callback, you could retrieve the object by simply using getitem on the group you’re iterating over:

>>> def printobj(name):

... print grp[name]

But that’s a pain; since the name argument supplied by visit is a relative path, your function has to know in advance what group it’ll be applied to. The previous example will work properly only when applied to grp.

Thankfully, HDF5 provides a more general way to deal with this. The method visit items supplies both the relative name and an instance of each object:

>>> def printobj2(name, obj):

... print name, obj

>>> grp.visititems(printobj2)

hardlink_to_dataset <HDF5 dataset "hardlink_to_dataset": shape (), type "<f8">

subgroup_1 <HDF5 group "/top_group_1/subgroup_1" (1 members)>

Since each object has to be opened, there is some overhead involved. You’re better off using visititems only in the case where you really need access to each object; for example, if you need to inspect attributes.

One way to make visit a little more generic is by using the built-in Python widget functools.partial. For example, here’s a trivial function that prints the absolute path of each object in the group:

>>> import posixpath

>>> from functools import partial

>>> def print_abspath(somegroup, name):

... """ Print *name* as an absolute path

... somegroup: HDF5 base group (*name* is relative to this)

... name: Object name relative to *somegroup*

... """

... print posixpath.join(somegroup.name, name)

>>> grp.visit(partial(print_abspath, grp))

/top_group_1/hardlink_to_dataset

/top_group_1/subgroup_1

Using this technique, you can avoid “embedding” the group you intend to iterate over in the function itself.

Canceling Iteration: A Simple Search Mechanism

There’s a simple way to “bail out” when visiting items. You might notice that our printname function has no explicit return value; in Python that means that the function returns None. If you return anything else, the visit or visititems method will immediately stop and return that value.

Let’s suppose that we want to find a dataset that has an attribute with a particular value:

>>> f['top_group_2/sub_dataset_2'].attrs['special'] = 42

Here’s a function that will find such an object, when supplied to visititems:

>>> def findspecial(name, obj):

... if obj.attrs.get('special') == 42:

... return obj

>>> out = f.visititems(findspecial)

>>> out

<HDF5 dataset "sub_dataset_2": shape (), type "<f8">

Copying Objects

HDF5 includes built-in facilities for copying objects from one place to another, freeing you from the tedious job of recursively walking the HDF5 tree, checking for duplicate links, copying over attributes, etc.

Single-File Copying

Let’s create a simple file to test this, with two groups and a dataset:

>>> f = h5py.File('copytest','w')

>>> f.create_group('mygroup')

>>> f.create_group('mygroup/subgroup')

>>> f.create_dataset('mygroup/apples', (100,))

Copying a dataset is straightforward, and results in a brand-new dataset, not a reference or link to the old one:

>>> f.copy('/mygroup/apples', '/oranges')

>>> f['oranges'] == f['mygroup/apples']

False

The great thing about the built-in HDF5 copy() is that it correctly handles recursively copying groups:

>>> f.copy('mygroup', 'mygroup2')

>>> f.visit(printname)

oranges

mygroup

mygroup/apples

mygroup/subgroup

mygroup2

mygroup2/apples

mygroup2/subgroup

You’re not limited to using paths for source and destination. If you already have an open Dataset object, for example, you can copy it to a Group or File object:

>>> dset = f['/mygroup/apples']

>>> f.copy(dset, f)

>>> f.visit(printname)

apples

oranges

mygroup

mygroup/dataset

mygroup/subgroup

mygroup2

mygroup2/dataset

mygroup2/subgroup

Since the destination is a group, the dataset is created with its “base name” of apples, analagous to how files are moved with the UNIX cp command.

There’s no requirement that the source and destination be the same file. This is one of the advantages of using File or Group objects instead of paths; the corresponding objects will be copied regardless of which file they reside in. If you’re trying to write generic code, it’s good to keep this in mind.

Object Comparison and Hashing

Let’s take a break from links and iteration to discuss a more subtle aspect of how HDF5 behaves. In lots of the preceding examples, we used Python’s equality operator to see if two groups are “the same thing”:

>>> f = h5py.File('objectdemo.hdf5','w')

>>> grpx = f.create_group('x')

>>> grpy = f.create_group('y')

>>> grpx == f['x']

True

>>> grpx == grpy

False

If we investigate further, we discover that this kind of equality testing is independent of whether the Python objects are one and the same:

>>> id(grpx) # Uniquely identifies the Python object "grpx"

73399280

>>> id(f['x'])

73966416

In h5py, equality testing uses the low-level HDF5 facilities to determine which references (identifiers, in the HDF5 lingo) point to the same groups or datasets on disk. This information is also used to compute the hash of an object, which means you can safely use Group, File, and Datasetobjects as dictionary keys or as the members of sets:

>>> hash(grpx)

587327447

>>> hash(f['x'])

587327447

There’s one more wrinkle for equality testing, and you may bump into it when using the .file property on objects: File and Group instances will compare equally when the Group instance represents the root group:

>>> f == f['/']

True

This is a consequence of the “double duty” File instances perform, representing both your file on disk and also the root group on the HDF5 side.

Finally, the truth value of an HDF5 object lets you see whether it’s alive or dead:

>>> bool(grpx)

True

>>> f.close()

>>> grpx

<Closed HDF5 group>

>>> bool(grpx)

False

Next, we discuss one thing that makes HDF5 so useful in real-world science: the ability to store data and metadata together side by side using attributes.