Using the Zarr library to read HDF5
See original GitHub issueThe USGS contracted the HDFGroup to do a test:
Could we make HDF5 format as performant on the Cloud as Zarr format by writing the HDF5 chunk locations into .zmetadata
and then having the Zarr library read from those chunks instead of Zarr format chunks?
From our first test the answer appears to be YES: https://gist.github.com/rsignell-usgs/3cbe15670bc2be05980dec7c5947b540
We modified both the zarr
and xarray
libraries to make that notebook possible, adding the FileChunkStore
concept. The modified libraries are: https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/zarr-hdf5/binder/environment.yml#L20-L21
Feel free to try running the notebook yourself:
(If you run into a 'stream is closed` error computing the max of the zarr data, just run the cell again.
I’m trying to figure out why that error occurs sometimes)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:9
- Comments:21 (6 by maintainers)
there is also this now, which could help: https://github.com/fsspec/kerchunk
I just read the notebook @ajelenak linked to. This makes it more clear. When the Python file-like object from fsspec is passed to
h5py.File
it doesn’t read the entire file, but knows to only parse specific byte ranges to get all the metadata it needs. Even though it makes a ton of requests, It won’t download the entire file which is what I was worried about. So theoretically you don’t need to make the.zmetadata
file, you could generate that information on the fly from ah5py.File
object, but for the best performance and least amount of HTTP requests (as @ajelenak pointed out) a.zmetadata
file should be created before processing. Correct me if I’m wrong.