Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

from_hdf5 function that uses filenames rather than files

See original GitHub issue

The current approach to load data from HDF5 is to load in an h5py or netcdf4 Dataset object and pass that to from_array. This is efficient, but also includes the file pointer within the dask graph. This fails if you need to serialize the graph, such as is necessary in distributed computing.

It might be wise to instead create an explicit da.from_hdf5 function that only stored the filename, datapath, and slice information within the graph.

It would also be nice if this API was exposed up to xarray (cc @shoyer).

cc @rabernat @pwolfram

Issue Analytics

State:
Created 8 years ago
Reactions:1
Comments:27 (21 by maintainers)

Top GitHub Comments

2reactions

DaanVanVugtcommented, Nov 9, 2017

@mrocklin I took your hdf5lazy approach a bit further by wrapping all h5py objects to support __{set,get}state__ and to remember how we opened the file (especially mode is important). An LRU cache is also implemented. This allows me to replace import h5py with import h5pickle as h5py and have working serialization in the dask distributed scheduler.

See the code at https://github.com/Exteris/h5pickle

This works very well for me in dask. Would you mind taking a look at the code? Perhaps this could be useful in distributed.

ps. Thanks for making dask so awesome

1reaction

jcristcommented, Jan 18, 2016

A few ideas:

Expanding your idea above, signature could be da.from_hdf5(filepath, datapath). Slice is composed later, as it feels more natural to index (also, : only works in getitem). Example:

z1 = da.from_hdf5('foo.h5', '/x/y/z1')[:, :100]
z2 = da.from_hdf5('foo.h5', '/x/y/z2')[:, :100]

Another approach would be to make a “lazy” version of the h5py File objects that don’t open the file pointer until explicitly asked to (or closes and reopens the file pointer when serialized). Example:

x = da.from_hdf5('foo.h5')
z1 = x['/x/y/z1'][:, :100]
z2 = x['/x/y/z2'][:, :100]

Top Results From Across the Web

Read data from HDF5 dataset - MATLAB h5read - MathWorks

This MATLAB function reads all the data from the dataset ds contained in the HDF5 file filename.

File Objects — h5py 3.7.0 documentation

HDF5 files work generally like standard Python file objects. They support standard modes like r/w/a, and should be closed when they are no...

TensorFlow - tf.data.Dataset reading large HDF5 files

from_tensor_slices(filenames ) returns a collection of Tensor objects rather than Python strings and thus the generator can't deal with this.

reading/writing tables from files, databases and other sources

The “from…” functions extract a table from a file-like source or database. For everything except petl.io.db.fromdb() the source argument provides information ...

Exploring performance when extracting subsets from HDF5

One of the cool features about the HDF5 file format is the ability ... to bottlenecks when working with data on-disk rather than...