Dataset

A Dataset is analogous to a file for an operating system and it is contained within a Group.

A Dataset is essentially a numpy.ndarray with Metadata and it can be accessed in read-only mode.

Since a Dataset can be thought of as an numpy.ndarray the attributes of an numpy.ndarray are also valid for a Dataset. For example, suppose my_dataset is a Dataset

>>> my_dataset
<Dataset '/my_dataset' shape=(5,) dtype='|V16' (2 metadata)>
>>> my_dataset.data
array([(0.23, 1.27), (1.86, 2.74), (3.44, 2.91), (5.91, 1.83), (8.73, 0.74)],
      dtype=[('x', '<f8'), ('y', '<f8')])

You can get the numpy.ndarray.shape using

>>> my_dataset.shape
(5,)

or convert the data in the Dataset to a Python list, using numpy.ndarray.tolist()

>>> my_dataset.tolist()
[(0.23, 1.27), (1.86, 2.74), (3.44, 2.91), (5.91, 1.83), (8.73, 0.74)]

To access the Metadata of a Dataset, you call the metadata attribute

>>> my_dataset.metadata
<Metadata '/my_dataset' {'temperature': 20.13, 'humidity': 45.31}>

You can access values of the Metadata as attributes

>>> my_dataset.metadata.temperature
20.13

or as keys

>>> my_dataset.metadata['humidity']
45.31

Depending on the numpy.dtype that was used to create the underlying numpy.ndarray for the Dataset the field names can also be accessed as field attributes. For example, you can access the fields in my_dataset as keys

>>> my_dataset['x']
array([0.23, 1.86, 3.44, 5.91, 8.73])

or as attributes

>>> my_dataset.x
array([0.23, 1.86, 3.44, 5.91, 8.73])

Note that the returned object is a numpy.ndarray and therefore does not contain any Metadata.

See Accessing Keys as Class Attributes for more information.

You can also chain multiple attribute calls together. For example, to get the maximum x value in my_dataset you can use

>>> my_dataset.x.max()
8.73

Slicing and Indexing

Slicing and indexing a Dataset is a valid operation, but returns a numpy.ndarray which does not contain any Metadata.

Consider my_dataset from above. One can slice it

>>> my_dataset[::2]
array([(0.23, 1.27), (3.44, 2.91), (8.73, 0.74)],
       dtype=[('x', '<f8'), ('y', '<f8')])

or index it

>>> my_dataset[2]
(3.44, 2.91)

Since a numpy.ndarray is returned, you are responsible for keeping track of the Metadata in slicing and indexing operations. For example,

>>> my_subset = root.create_dataset('my_subset', data=my_dataset[::2], **my_dataset.metadata)
>>> my_subset
<Dataset '/my_subset' shape=(3,) dtype='|V16' (2 metadata)>
>>> my_subset.data
array([(0.23, 1.27), (3.44, 2.91), (8.73, 0.74)],
       dtype=[('x', '<f8'), ('y', '<f8')])
>>> my_subset.metadata
<Metadata '/my_subset' {'temperature': 20.13, 'humidity': 45.31}>

Arithmetic Operations

Arithmetic operations are valid with a Dataset, however, the returned object will be a numpy.ndarray and therefore all Metadata of the Datasets that are involved in the operation are not included in the returned object.

For example, suppose you have two Datasets that contain the following information

>>> dset1
<Dataset '/dset1' shape=(3,) dtype='<f8' (1 metadata)>
>>> dset1.data
array([1., 2., 3.])
>>> dset1.metadata
<Metadata '/dset1' {'temperature': 20.3}>

>>> dset2
<Dataset '/dset2' shape=(3,) dtype='<f8' (1 metadata)>
>>> dset2.data
array([4., 5., 6.])
>>> dset2.metadata
<Metadata '/dset2' {'temperature': 21.7}>

You can directly add the Datasets, but the temperature values in Metadata are not included in the returned object

>>> dset3 = dset1 + dset2
>>> dset3
array([5., 7., 9.])
>>> dset3.metadata
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'metadata'

You are responsible for keeping track of the Metadata in arithmetic operations, for example,

>>> temperatures = {'t1': dset1.metadata.temperature, 't2': dset2.metadata.temperature}
>>> dset3 = root.create_dataset('dset3', data=dset1+dset2, temperatures=temperatures)
>>> dset3
<Dataset '/dset3' shape=(3,) dtype='<f8' (1 metadata)>
>>> dset3.data
array([5., 7., 9.])
>>> dset3.metadata
<Metadata '/dset3' {'temperatures': {'t1': 20.3, 't2': 21.7}}>

A Dataset for Logging Records

The DatasetLogging class is a custom Dataset that is also a Handler which automatically appends logging records to the Dataset. See create_dataset_logging() for more details.

When a file is read() it will load an object that was once a DatasetLogging as a Dataset. If you want to convert the Dataset to be a DatasetLogging object, so that logging records are once again appended to it, then call the require_dataset_logging() method with the name argument equal to the value of name for the Dataset.