question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using the Zarr library to read HDF5

See original GitHub issue

The USGS contracted the HDFGroup to do a test: Could we make HDF5 format as performant on the Cloud as Zarr format by writing the HDF5 chunk locations into .zmetadata and then having the Zarr library read from those chunks instead of Zarr format chunks?

From our first test the answer appears to be YES: https://gist.github.com/rsignell-usgs/3cbe15670bc2be05980dec7c5947b540

We modified both the zarr and xarray libraries to make that notebook possible, adding the FileChunkStore concept. The modified libraries are: https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/zarr-hdf5/binder/environment.yml#L20-L21

Feel free to try running the notebook yourself: Binder (If you run into a 'stream is closed` error computing the max of the zarr data, just run the cell again. I’m trying to figure out why that error occurs sometimes)

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:9
  • Comments:21 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
satracommented, Mar 28, 2022

there is also this now, which could help: https://github.com/fsspec/kerchunk

1reaction
djhoesecommented, Feb 10, 2020

you just need to extract the metadata from the existing GOES NetCDF4 files

I just read the notebook @ajelenak linked to. This makes it more clear. When the Python file-like object from fsspec is passed to h5py.File it doesn’t read the entire file, but knows to only parse specific byte ranges to get all the metadata it needs. Even though it makes a ton of requests, It won’t download the entire file which is what I was worried about. So theoretically you don’t need to make the .zmetadata file, you could generate that information on the fly from a h5py.File object, but for the best performance and least amount of HTTP requests (as @ajelenak pointed out) a .zmetadata file should be created before processing. Correct me if I’m wrong.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cloud-Performant NetCDF4/HDF5 Reading with the Zarr Library
The Zarr library is used to access multiple chunks of Zarr data in parallel. But what if Zarr were used to access multiple...
Read more >
Tutorial — zarr 2.13.3 documentation - Read the Docs
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and...
Read more >
Cloud-performant reading of NetCDF4/HDF5/Grib2 using the ...
Cloud-performant reading of NetCDF4/HDF5/Grib2 using the Zarr library ... Abstract. Many organizations are moving their data to cloud-hosted object storage, which ...
Read more >
To HDF5 and beyond - Alistair Miles
Note that although I've used the LZ4 compression library with Bcolz and Zarr, the compression ratio is actually better than when using gzip...
Read more >
A Comparison of HDF5, Zarr, and netCDF4 in Performing ...
read data is to memory map the file with NumPy, bypassing the HDF5 Python API (h5py). ... netCDF4's use of the HDF5 library...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found