Using the Zarr library to read HDF5
See original GitHub issueThe USGS contracted the HDFGroup to do a test:
Could we make HDF5 format as performant on the Cloud as Zarr format by writing the HDF5 chunk locations into .zmetadata and then having the Zarr library read from those chunks instead of Zarr format chunks?
From our first test the answer appears to be YES: https://gist.github.com/rsignell-usgs/3cbe15670bc2be05980dec7c5947b540
We modified both the zarr and xarray libraries to make that notebook possible, adding the FileChunkStore concept. The modified libraries are: https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/zarr-hdf5/binder/environment.yml#L20-L21
Feel free to try running the notebook yourself:
(If you run into a 'stream is closed` error computing the max of the zarr data, just run the cell again.
I’m trying to figure out why that error occurs sometimes)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:9
- Comments:21 (6 by maintainers)

Top Related StackOverflow Question
there is also this now, which could help: https://github.com/fsspec/kerchunk
I just read the notebook @ajelenak linked to. This makes it more clear. When the Python file-like object from fsspec is passed to
h5py.Fileit doesn’t read the entire file, but knows to only parse specific byte ranges to get all the metadata it needs. Even though it makes a ton of requests, It won’t download the entire file which is what I was worried about. So theoretically you don’t need to make the.zmetadatafile, you could generate that information on the fly from ah5py.Fileobject, but for the best performance and least amount of HTTP requests (as @ajelenak pointed out) a.zmetadatafile should be created before processing. Correct me if I’m wrong.