Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using kerchunk to reference large sets of netcdf4 files

See original GitHub issue

Firstly, thanks for this great tool!

I’m trying to generate a kerchunk reference dataset for many chunked netcdf4 files that comprise a single climate model experiment. I’m doing this on a local file system, not in a cloud environment, similar to https://github.com/fsspec/kerchunk/issues/123.

The climate model experiment in question comprises 2TB across 61 netcdf files (unfortunately I can’t share these data). I generate a single reference json using the approach provided in the tutorial (code below). This all works well, and I can open my combined dataset using xarray.open_dataset and see that it has the correct structure and chunking.

However, when I try to perform a small compute on a variable in this dataset using a dask distributed cluster (with 4GB per worker) I immediately run out of memory. My reference json is ~1GB. Is this being loaded by each worker? I am confused because there are examples in the docs of this approach being applied to 80TB datasets. However, based on my simple example, I would’ve thought that the reference json(s) for an 80TB dataset would be prohibitively large. Am I doing something wrong/misunderstanding? Any advice would be much appreciated.

import fsspec
import ujson
import dask
from distributed import Client
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr


client = Client(n_workers=4)
print(f"Dask dashboard at: {client.dashboard_link}")

# Write single file jsons in parallel

fs = fsspec.filesystem('file')
flist = fs.glob(f"experiment1/ncfile*.nc")

@dask.delayed
def gen_json(file):
    with fs.open(file) as infile:
        h5chunks = SingleHdf5ToZarr(infile, file)
        outf = f"{file.split('/')[-1]}.json"
        with open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode());
    
dask.compute(*[gen_json(file) for file in flist])

# Combine into multifile json

json_list = fs.glob("./ncfile*.nc.json")

mzz = MultiZarrToZarr(
    json_list,
    concat_dims=['time'],
)

d = mzz.translate("experiment1.json")

# Open the reference dataset

m = fsspec.get_mapper(
    'reference://', 
    fo="experiment1.json", 
    remote_protocol="file"
)
ds = xr.open_dataset(
    m,
    engine='zarr', 
    backend_kwargs={"consolidated": False},
    chunks={},
    decode_times=False
)

# Performing a simple compute task on a single variable of ds uses large amounts of memory

ds["variable"].mean(["lon", "lat"]).compute()

Issue Analytics

State:
Created a year ago
Comments:11 (6 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Nov 10, 2022

OK, so either way we will need an alternative mechanism to list at least some things (top level dirs for v2, meta files for v3), since we absolutely must avoid having to do string manipulation across all references.

0reactions

jhammancommented, Nov 9, 2022

Zarr v3’s spec does not (yet) include consolidated metadata. This extension needs to be written by someone (https://github.com/zarr-developers/zarr-specs/issues/136).

In the v2 spec, consolidated metadata only consolidates the metadata keys. However, in v3, you might imagine a use case where a Zarr storage transformer is used to consolidate the listing of the chunk keys. This sounds a lot like the Kerchunk reference spec and could also be thought of as general chunk manifest (https://github.com/zarr-developers/zarr-specs/issues/82).

In v3, do you know if/when a directory listing would be needed?

To answer your specifically answer your question. In the xarray context, open_zarr() needs to list all metadata keys to determine what variables exist in the group. This is the only listing that should be required (though I think the current v3 implementation includes a few unnecessary listings (we’re working to fix this)).

Top Results From Across the Web

Using Kerchunk to read NetCDF4 data on AWS S3 as Zarr for ...

Note: This story has been updated to reflect the renaming of fsspec-reference-maker to kerchunk and to update the MultiZarrToZarr API change ...

pangeo-data/cloud-performant-netcdf4 - Gitter

I tried using the latest kerchunk from master, and I'm getting this error on ... wondered about is what is the best container...

kerchunk documentation - GitHub Pages

It is a gateway to in-the-cloud massive data processing while the data ... loading of various file types (currently netcdf4/HDF, grib2, tiff, fits,...

Efficient access to IOOS data in the cloud · Issue #14 - GitHub

There is a package that assists in the creation of these JSON files, called Kerchunk. It currently reads a collection of NetCDF4, GRIB2...

Recommendation for hosting cloud-optimized data

Both Zarr and the kerchunk / reference filesystem achieve our original goal: we can quickly read the metadata for these collections in ~1...