Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using the Zarr library to read HDF5

See original GitHub issue

The USGS contracted the HDFGroup to do a test: Could we make HDF5 format as performant on the Cloud as Zarr format by writing the HDF5 chunk locations into .zmetadata and then having the Zarr library read from those chunks instead of Zarr format chunks?

From our first test the answer appears to be YES: https://gist.github.com/rsignell-usgs/3cbe15670bc2be05980dec7c5947b540

We modified both the zarr and xarray libraries to make that notebook possible, adding the FileChunkStore concept. The modified libraries are: https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/zarr-hdf5/binder/environment.yml#L20-L21

Feel free to try running the notebook yourself: (If you run into a 'stream is closed` error computing the max of the zarr data, just run the cell again. I’m trying to figure out why that error occurs sometimes)

Issue Analytics

State:
Created 4 years ago
Reactions:9
Comments:21 (6 by maintainers)

Top GitHub Comments

1reaction

satracommented, Mar 28, 2022

there is also this now, which could help: https://github.com/fsspec/kerchunk

1reaction

djhoesecommented, Feb 10, 2020

you just need to extract the metadata from the existing GOES NetCDF4 files

I just read the notebook @ajelenak linked to. This makes it more clear. When the Python file-like object from fsspec is passed to h5py.File it doesn’t read the entire file, but knows to only parse specific byte ranges to get all the metadata it needs. Even though it makes a ton of requests, It won’t download the entire file which is what I was worried about. So theoretically you don’t need to make the .zmetadata file, you could generate that information on the fly from a h5py.File object, but for the best performance and least amount of HTTP requests (as @ajelenak pointed out) a .zmetadata file should be created before processing. Correct me if I’m wrong.

Top Results From Across the Web

Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library

The Zarr library is used to access multiple chunks of Zarr data in parallel. But what if Zarr were used to access multiple...

Tutorial — zarr 2.13.3 documentation - Read the Docs

Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and...

Cloud-performant reading of NetCDF4/HDF5/Grib2 using the ...

Cloud-performant reading of NetCDF4/HDF5/Grib2 using the Zarr library ... Abstract. Many organizations are moving their data to cloud-hosted object storage, which ...

To HDF5 and beyond - Alistair Miles

Note that although I've used the LZ4 compression library with Bcolz and Zarr, the compression ratio is actually better than when using gzip...

A Comparison of HDF5, Zarr, and netCDF4 in Performing ...

read data is to memory map the file with NumPy, bypassing the HDF5 Python API (h5py). ... netCDF4's use of the HDF5 library...