question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

from_hdf5 function that uses filenames rather than files

See original GitHub issue

The current approach to load data from HDF5 is to load in an h5py or netcdf4 Dataset object and pass that to from_array. This is efficient, but also includes the file pointer within the dask graph. This fails if you need to serialize the graph, such as is necessary in distributed computing.

It might be wise to instead create an explicit da.from_hdf5 function that only stored the filename, datapath, and slice information within the graph.

It would also be nice if this API was exposed up to xarray (cc @shoyer).

cc @rabernat @pwolfram

Issue Analytics

  • State:open
  • Created 8 years ago
  • Reactions:1
  • Comments:27 (21 by maintainers)

github_iconTop GitHub Comments

2reactions
DaanVanVugtcommented, Nov 9, 2017

@mrocklin I took your hdf5lazy approach a bit further by wrapping all h5py objects to support __{set,get}state__ and to remember how we opened the file (especially mode is important). An LRU cache is also implemented. This allows me to replace import h5py with import h5pickle as h5py and have working serialization in the dask distributed scheduler.

See the code at https://github.com/Exteris/h5pickle

This works very well for me in dask. Would you mind taking a look at the code? Perhaps this could be useful in distributed.

ps. Thanks for making dask so awesome

1reaction
jcristcommented, Jan 18, 2016

A few ideas:

  • Expanding your idea above, signature could be da.from_hdf5(filepath, datapath). Slice is composed later, as it feels more natural to index (also, : only works in getitem). Example:
z1 = da.from_hdf5('foo.h5', '/x/y/z1')[:, :100]
z2 = da.from_hdf5('foo.h5', '/x/y/z2')[:, :100]
  • Another approach would be to make a “lazy” version of the h5py File objects that don’t open the file pointer until explicitly asked to (or closes and reopens the file pointer when serialized). Example:
x = da.from_hdf5('foo.h5')
z1 = x['/x/y/z1'][:, :100]
z2 = x['/x/y/z2'][:, :100]
Read more comments on GitHub >

github_iconTop Results From Across the Web

Read data from HDF5 dataset - MATLAB h5read - MathWorks
This MATLAB function reads all the data from the dataset ds contained in the HDF5 file filename.
Read more >
File Objects — h5py 3.7.0 documentation
HDF5 files work generally like standard Python file objects. They support standard modes like r/w/a, and should be closed when they are no...
Read more >
TensorFlow - tf.data.Dataset reading large HDF5 files
from_tensor_slices(filenames ) returns a collection of Tensor objects rather than Python strings and thus the generator can't deal with this.
Read more >
reading/writing tables from files, databases and other sources
The “from…” functions extract a table from a file-like source or database. For everything except petl.io.db.fromdb() the source argument provides information ...
Read more >
Exploring performance when extracting subsets from HDF5
One of the cool features about the HDF5 file format is the ability ... to bottlenecks when working with data on-disk rather than...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found