question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using kerchunk to reference large sets of netcdf4 files

See original GitHub issue

Firstly, thanks for this great tool!

I’m trying to generate a kerchunk reference dataset for many chunked netcdf4 files that comprise a single climate model experiment. I’m doing this on a local file system, not in a cloud environment, similar to https://github.com/fsspec/kerchunk/issues/123.

The climate model experiment in question comprises 2TB across 61 netcdf files (unfortunately I can’t share these data). I generate a single reference json using the approach provided in the tutorial (code below). This all works well, and I can open my combined dataset using xarray.open_dataset and see that it has the correct structure and chunking.

However, when I try to perform a small compute on a variable in this dataset using a dask distributed cluster (with 4GB per worker) I immediately run out of memory. My reference json is ~1GB. Is this being loaded by each worker? I am confused because there are examples in the docs of this approach being applied to 80TB datasets. However, based on my simple example, I would’ve thought that the reference json(s) for an 80TB dataset would be prohibitively large. Am I doing something wrong/misunderstanding? Any advice would be much appreciated.

import fsspec
import ujson
import dask
from distributed import Client
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr


client = Client(n_workers=4)
print(f"Dask dashboard at: {client.dashboard_link}")

# Write single file jsons in parallel

fs = fsspec.filesystem('file')
flist = fs.glob(f"experiment1/ncfile*.nc")

@dask.delayed
def gen_json(file):
    with fs.open(file) as infile:
        h5chunks = SingleHdf5ToZarr(infile, file)
        outf = f"{file.split('/')[-1]}.json"
        with open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode());
    
dask.compute(*[gen_json(file) for file in flist])

# Combine into multifile json

json_list = fs.glob("./ncfile*.nc.json")

mzz = MultiZarrToZarr(
    json_list,
    concat_dims=['time'],
)

d = mzz.translate("experiment1.json")

# Open the reference dataset

m = fsspec.get_mapper(
    'reference://', 
    fo="experiment1.json", 
    remote_protocol="file"
)
ds = xr.open_dataset(
    m,
    engine='zarr', 
    backend_kwargs={"consolidated": False},
    chunks={},
    decode_times=False
)

# Performing a simple compute task on a single variable of ds uses large amounts of memory

ds["variable"].mean(["lon", "lat"]).compute()

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Nov 10, 2022

OK, so either way we will need an alternative mechanism to list at least some things (top level dirs for v2, meta files for v3), since we absolutely must avoid having to do string manipulation across all references.

0reactions
jhammancommented, Nov 9, 2022

Zarr v3’s spec does not (yet) include consolidated metadata. This extension needs to be written by someone (https://github.com/zarr-developers/zarr-specs/issues/136).

In the v2 spec, consolidated metadata only consolidates the metadata keys. However, in v3, you might imagine a use case where a Zarr storage transformer is used to consolidate the listing of the chunk keys. This sounds a lot like the Kerchunk reference spec and could also be thought of as general chunk manifest (https://github.com/zarr-developers/zarr-specs/issues/82).


In v3, do you know if/when a directory listing would be needed?

To answer your specifically answer your question. In the xarray context, open_zarr() needs to list all metadata keys to determine what variables exist in the group. This is the only listing that should be required (though I think the current v3 implementation includes a few unnecessary listings (we’re working to fix this)).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using Kerchunk to read NetCDF4 data on AWS S3 as Zarr for ...
Note: This story has been updated to reflect the renaming of fsspec-reference-maker to kerchunk and to update the MultiZarrToZarr API change ...
Read more >
pangeo-data/cloud-performant-netcdf4 - Gitter
I tried using the latest kerchunk from master, and I'm getting this error on ... wondered about is what is the best container...
Read more >
kerchunk documentation - GitHub Pages
It is a gateway to in-the-cloud massive data processing while the data ... loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,...
Read more >
Efficient access to IOOS data in the cloud · Issue #14 - GitHub
There is a package that assists in the creation of these JSON files, called Kerchunk. It currently reads a collection of NetCDF4, GRIB2...
Read more >
Recommendation for hosting cloud-optimized data
Both Zarr and the kerchunk / reference filesystem achieve our original goal: we can quickly read the metadata for these collections in ~1...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found