Menu +

Search Posts

h5py example : Organizing Data and Metadata, Coping with Large Data Volumes

Organizing Data and Metadata

Suppose we have a NumPy array that represents some data from an experiment:

>>> import numpy as np 
>>> temperature = np.random.random(1024)
>>> temperature
array([ 0.44149738, 0.7407523 , 0.44243584, ..., 0.19018119, 0.64844851, 0.55660748])

Let’s also imagine that these data points were recorded from a weather station that sampled the temperature, say, every 10 seconds. In order to make sense of the data, we have to record that sampling interval, or “delta-T,” somewhere. For now we’ll put it in a Python variable:

>>> dt = 10.0

The data acquisition started at a particular time, which we will also need to record. And of course, we have to know that the data came from Weather Station 15:

>>> start_time = 1375204299  # in Unix time 
>>> station = 15

We could use the built-in NumPy function np.savez to store these values on disk. This simple function saves the values as NumPy arrays, packed together in a ZIP file with associated names:

>>> np.savez("weather.npz", data=temperature, start_time=start_time, station=
station)

We can get the values back from the file with np.load:

>>> out = np.load("weather.npz") 
>>> out["data"]
array([ 0.44149738, 0.7407523 , 0.44243584, ..., 0.19018119, 0.64844851, 0.55660748])
>>> out["start_time"]
array(1375204299)
>>> out["station"]
array(15)

So far so good. But what if we have more than one quantity per station? Say there’s also wind speed data to record?

>>> wind = np.random.random(2048) 
>>> dt_wind = 5.0 # Wind sampled every 5 seconds

And suppose we have multiple stations. We could introduce some kind of naming convention, I suppose: “wind_15” for the wind values from station 15, and things like “dt_wind_15” for the sampling interval. Or we could use multiple files…

In contrast, here’s how this application might approach storage with HDF5:

>>> import h5py 
>>> f = h5py.File("weather.hdf5")
>>> f["/15/temperature"] = temperature
>>> f["/15/temperature"].attrs["dt"] = 10.0
>>> f["/15/temperature"].attrs["start_time"] = 1375204299
>>> f["/15/wind"] = wind
>>> f["/15/wind"].attrs["dt"] = 5.0
---
>>> f["/20/temperature"] = temperature_from_station_20
---
(and so on)

This example illustrates two of the “killer features” of HDF5: organization in hierarchical groups and attributes. Groups, like folders in a filesystem, let you store related datasets together. In this case, temperature and wind measurements from the same weather station are stored together under groups named “/15,” “/20,” etc. Attributes let you attach descriptive metadata directly to the data it describes. So if you give this file to a colleague, she can easily discover the information needed to make sense of the data:

>>> dataset = f["/15/temperature"] 
>>> for key, value in dataset.attrs.iteritems():
... print "%s: %s" % (key, value)
dt: 10.0
start_time: 1375204299

Coping with Large Data Volumes

As a high-level “glue” language, Python is increasingly being used for rapid visualization of big datasets and to coordinate large-scale computations that run in compiled languages like C and FORTRAN. It’s now relatively common to deal with datasets hundreds of gigabytes or even terabytes in size; HDF5 itself can scale up to exabytes.

On all but the biggest machines, it’s not feasible to load such datasets directly into memory. One of HDF5’s greatest strengths is its support for subsetting and partial I/O. For example, let’s take the 1024-element “temperature” dataset we created earlier:

>>> dataset = f["/15/temperature"]

Here, the object named dataset is a proxy object representing an HDF5 dataset. It supports array-like slicing operations, which will be familiar to frequent NumPy users:

>>> dataset[0:10] 
array([ 0.44149738, 0.7407523 , 0.44243584, 0.3100173 , 0.04552416, 0.43933469, 0.28550775, 0.76152561, 0.79451732, 0.32603454])
>>> dataset[0:10:2] array([ 0.44149738, 0.44243584, 0.04552416, 0.28550775, 0.79451732])

Keep in mind that the actual data lives on disk; when slicing is applied to an HDF5 dataset, the appropriate data is found and loaded into memory. Slicing in this fashion leverages the underlying subsetting capabilities of HDF5 and is consequently very fast.

Another great thing about HDF5 is that you have control over how storage is allocated. For example, except for some metadata, a brand new dataset takes zero space, and by default bytes are only used on disk to hold the data you actually write.

For example, here’s a 2-terabyte dataset you can create on just about any computer:

>>> big_dataset = f.create_dataset("big", shape=(1024, 1024, 1024, 512), dtype='float32')

Although no storage is yet allocated, the entire “space” of the dataset is available to us. We can write anywhere in the dataset, and only the bytes on disk necessary to hold the data are used:

>>> big_dataset[344, 678, 23, 36] = 42.0

When storage is at a premium, you can even use transparent compression on a dataset-by-dataset basis

>>> compressed_dataset = f.create_dataset("comp", shape=(1024,), dtype='int32', compression='gzip') 
>>> compressed_dataset[:] = np.arange(1024)
>>> compressed_dataset[:]
array([ 0, 1, 2, ..., 1021, 1022, 1023])

Leave a Reply

Your email address will not be published. Required fields are marked *