Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Apply kerchunk to one-timestep-per-file (climate model) output

See original GitHub issue

I want to use kerchunk to open climate model netcdf output stored 1 or 12 timestep/s per file in many files as one huge zarr.
I do not care about the cloud capabilities. I just need to work this out locally.
I do not want to convert nc files to zarr, because I dont have the storage to keep both and need to keep nc.
Therefore I thought kerchunk could help me.

I wrote a small example below but it fails with an zarr shape error which I do not understand. I get SingleHdf5ToZarr running, but MultiZarrToZarr fails. Does anyone see a fix for this? Is this example supposed to work?

import os
import zipfile
import kerchunk.hdf
import fsspec
import json
import xarray as xr
import numpy as np

# dummy data
ds = xr.DataArray(name='var', data=np.random.random((100,100)), dims=['lon','lat'], coords={'lon':range(100), 'lat':range(100)}).to_dataset()#.expand_dims('time')
urls =[f'test_{y}.nc' for y in range(2000,2011)]
for u in urls:
    ds.to_netcdf(u)

so = dict(
    anon=True, default_fill_cache=False, default_cache_type='first'
)
with zipfile.ZipFile("out.zip", mode="w") as zf:
    for u in urls:
        print(u)
        print(os.path.exists(u))
        with fsspec.open(u, **so) as inf:
            h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)#, mode='r')
            with zf.open(os.path.basename(u) + ".json", 'w') as outf:
                outf.write(json.dumps(h5chunks.translate()).encode())

from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
    "zip://*.json::out.zip",
    remote_protocol="file",
    xarray_open_kwargs={
       # "preprocess": None,#drop_coords,
        "decode_cf": False,
        "mask_and_scale": False,
        "decode_times": False,
        "decode_timedelta": False,
        "use_cftime": False,
        "decode_coords": False
    },
    xarray_concat_args={'dim':'time'}
)
#test_dict = mzz.translate()

# open single json as zarr
test_dict='test_2000.nc.json' # after unzip out.zip
m = fsspec.get_mapper('reference://', fo=test_dict, remote_protocoll='file')
print(xr.open_dataset(m, engine='zarr')) # works


mzz.translate("output.zarr") # see Traceback below

# This can also be written as a json
mzz.translate("output.json") # same Traceback

Output

(xr) aaron.spring@MacBook-Pro:~/Coding/kerchunk/my_kerchunk$ python run.py
test_2000.nc
True
test_2001.nc
True
test_2002.nc
True
test_2003.nc
True
test_2004.nc
True
test_2005.nc
True
test_2006.nc
True
test_2007.nc
True
test_2008.nc
True
test_2009.nc
True
test_2010.nc
True
/Users/aaron.spring/Coding/kerchunk/my_kerchunk/run.py:49: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
  print(xr.open_dataset(m, engine='zarr'))
<xarray.Dataset>
Dimensions:  (lat: 100, lon: 100)
Coordinates:
  * lat      (lat) float64 0.0 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
  * lon      (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
Data variables:
    var      (lon, lat) float64 ...
Traceback (most recent call last):
  File "/Users/aaron.spring/Coding/kerchunk/my_kerchunk/run.py", line 52, in <module>
    mzz.translate("output.zarr")
  File "/Users/aaron.spring/Coding/kerchunk/kerchunk/combine.py", line 79, in translate
    out = self._build_output(ds, ds0, fss)
  File "/Users/aaron.spring/Coding/kerchunk/kerchunk/combine.py", line 185, in _build_output
    acc_len = make_coord(fss, z, accum_dim)
  File "/Users/aaron.spring/Coding/kerchunk/kerchunk/combine.py", line 301, in make_coord
    zz = zarr.open_array(fs.get_mapper(accum_dim))
  File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/creation.py", line 526, in open_array
    init_array(store, shape=shape, chunks=chunks, dtype=dtype,
  File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/storage.py", line 353, in init_array
    _init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
  File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/storage.py", line 390, in _init_array_metadata
    shape = normalize_shape(shape) + dtype.shape
  File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/util.py", line 42, in normalize_shape
    raise TypeError('shape is None')
TypeError: shape is None

Versions:

zarr 2.10.3
kerchunk 0.0.5+47.gf42b0c2 (current main)
fsspec 2022.1.0
xarray 0.20.2

Issue Analytics

State:
Created 2 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

2reactions

martindurantcommented, Feb 4, 2022

.expand_dims('time')

This just adds “time” to the list of dims but doesn’t assign any values, so there’s no way to determine what the output “time” coordinate should be. Interestingly, in #122 I am developing a new version of combine, where you can define how to extract the concat coord(s) from each dataset. The following works, for example

In [20]: from kerchunk.combine import MultiZarrToZarr
    ...: mzz = MultiZarrToZarr(
    ...:     "zip://*.json::out.zip",
    ...:     remote_protocol="file",
    ...:     concat_dims=["time"],
    ...:     coo_map={"time": "INDEX"}
    ...: )
In [21]: test_dict = mzz.translate()
In [22]: m = fsspec.get_mapper('reference://', fo=test_dict, remote_protocoll='file')
In [23]: xr.open_dataset(m, engine='zarr', backend_kwargs={"consolidated": False})
Out[23]:
<xarray.Dataset>
Dimensions:  (lat: 100, lon: 100, time: 11)
Coordinates:
  * lat      (lat) float64 0.0 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
  * lon      (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
  * time     (time) float64 nan 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
Data variables:
    var      (time, lon, lat) float64 ...

where we use the coo_map argument we define the value of “time” to be equal to the sequential index of the inputs. The output dtype of “time” should be int64, not float64, but this is pretty good.

0reactions

martindurantcommented, Feb 28, 2022

On second thoughts, since the data is uncompressed, it may well be that h5py can read it directly without copy and without the GIL, whereas kerchunk/referenceFS deals with bytes. We could implement opening as mmap and passing around memorybuffers.

Top Results From Across the Web

From Climate Model Output to Actionable Climate Information ...

This approach consisting of using downscaled, bias-adjusted climate data to simulate hydrological variables is often referred to as the climate–hydrological ...

Using Weather Data and Climate Model Output in Economic ...

Abstract. There is a long history of using weather measures as explanatory variables in statistical models. For example, Fisher (1925) examined the effects ......

[2012.10394] Deep Learning for Climate Model Output Statistics

In this work, we explore the feasibility and potential of deep learning with convolutional neural networks (CNNs) for MOS. We propose the CNN ......

Accessing NetCDF and GRIB file collections as cloud-native ...

The Kerchunk library provides a way to map the internal layout and chunking ... as a platform for disseminating IOOS regional ocean forecast...

Using Weather Data and Climate Model Output in Economic ...

We then provide a brief overview of climate models and discuss two common and significant errors often made by economists when climate model...