Using kerchunk to reference large sets of netcdf4 files
See original GitHub issueFirstly, thanks for this great tool!
I’m trying to generate a kerchunk reference dataset for many chunked netcdf4 files that comprise a single climate model experiment. I’m doing this on a local file system, not in a cloud environment, similar to https://github.com/fsspec/kerchunk/issues/123.
The climate model experiment in question comprises 2TB across 61 netcdf files (unfortunately I can’t share these data). I generate a single reference json using the approach provided in the tutorial (code below). This all works well, and I can open my combined dataset using xarray.open_dataset
and see that it has the correct structure and chunking.
However, when I try to perform a small compute on a variable in this dataset using a dask distributed cluster (with 4GB per worker) I immediately run out of memory. My reference json is ~1GB. Is this being loaded by each worker? I am confused because there are examples in the docs of this approach being applied to 80TB datasets. However, based on my simple example, I would’ve thought that the reference json(s) for an 80TB dataset would be prohibitively large. Am I doing something wrong/misunderstanding? Any advice would be much appreciated.
import fsspec
import ujson
import dask
from distributed import Client
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
client = Client(n_workers=4)
print(f"Dask dashboard at: {client.dashboard_link}")
# Write single file jsons in parallel
fs = fsspec.filesystem('file')
flist = fs.glob(f"experiment1/ncfile*.nc")
@dask.delayed
def gen_json(file):
with fs.open(file) as infile:
h5chunks = SingleHdf5ToZarr(infile, file)
outf = f"{file.split('/')[-1]}.json"
with open(outf, 'wb') as f:
f.write(ujson.dumps(h5chunks.translate()).encode());
dask.compute(*[gen_json(file) for file in flist])
# Combine into multifile json
json_list = fs.glob("./ncfile*.nc.json")
mzz = MultiZarrToZarr(
json_list,
concat_dims=['time'],
)
d = mzz.translate("experiment1.json")
# Open the reference dataset
m = fsspec.get_mapper(
'reference://',
fo="experiment1.json",
remote_protocol="file"
)
ds = xr.open_dataset(
m,
engine='zarr',
backend_kwargs={"consolidated": False},
chunks={},
decode_times=False
)
# Performing a simple compute task on a single variable of ds uses large amounts of memory
ds["variable"].mean(["lon", "lat"]).compute()
Issue Analytics
- State:
- Created a year ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
OK, so either way we will need an alternative mechanism to list at least some things (top level dirs for v2, meta files for v3), since we absolutely must avoid having to do string manipulation across all references.
Zarr v3’s spec does not (yet) include consolidated metadata. This extension needs to be written by someone (https://github.com/zarr-developers/zarr-specs/issues/136).
In the v2 spec, consolidated metadata only consolidates the metadata keys. However, in v3, you might imagine a use case where a Zarr storage transformer is used to consolidate the listing of the chunk keys. This sounds a lot like the Kerchunk reference spec and could also be thought of as general chunk manifest (https://github.com/zarr-developers/zarr-specs/issues/82).
To answer your specifically answer your question. In the xarray context,
open_zarr()
needs to list all metadata keys to determine what variables exist in the group. This is the only listing that should be required (though I think the current v3 implementation includes a few unnecessary listings (we’re working to fix this)).