from_hdf5 function that uses filenames rather than files
See original GitHub issueThe current approach to load data from HDF5 is to load in an h5py or netcdf4 Dataset object and pass that to from_array
. This is efficient, but also includes the file pointer within the dask graph. This fails if you need to serialize the graph, such as is necessary in distributed computing.
It might be wise to instead create an explicit da.from_hdf5
function that only stored the filename, datapath, and slice information within the graph.
It would also be nice if this API was exposed up to xarray
(cc @shoyer).
Issue Analytics
- State:
- Created 8 years ago
- Reactions:1
- Comments:27 (21 by maintainers)
Top Results From Across the Web
Read data from HDF5 dataset - MATLAB h5read - MathWorks
This MATLAB function reads all the data from the dataset ds contained in the HDF5 file filename.
Read more >File Objects — h5py 3.7.0 documentation
HDF5 files work generally like standard Python file objects. They support standard modes like r/w/a, and should be closed when they are no...
Read more >TensorFlow - tf.data.Dataset reading large HDF5 files
from_tensor_slices(filenames ) returns a collection of Tensor objects rather than Python strings and thus the generator can't deal with this.
Read more >reading/writing tables from files, databases and other sources
The “from…” functions extract a table from a file-like source or database. For everything except petl.io.db.fromdb() the source argument provides information ...
Read more >Exploring performance when extracting subsets from HDF5
One of the cool features about the HDF5 file format is the ability ... to bottlenecks when working with data on-disk rather than...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mrocklin I took your hdf5lazy approach a bit further by wrapping all h5py objects to support
__{set,get}state__
and to remember how we opened the file (especially mode is important). An LRU cache is also implemented. This allows me to replaceimport h5py
withimport h5pickle as h5py
and have working serialization in the dask distributed scheduler.See the code at https://github.com/Exteris/h5pickle
This works very well for me in dask. Would you mind taking a look at the code? Perhaps this could be useful in distributed.
ps. Thanks for making dask so awesome
A few ideas:
da.from_hdf5(filepath, datapath)
. Slice is composed later, as it feels more natural to index (also,:
only works in getitem). Example: