Apply kerchunk to one-timestep-per-file (climate model) output
See original GitHub issue- I want to use
kerchunk
to open climate model netcdf output stored 1 or 12 timestep/s per file in many files as one hugezarr
. - I do not care about the cloud capabilities. I just need to work this out locally.
- I do not want to convert
nc
files tozarr
, because I dont have the storage to keep both and need to keepnc
. - Therefore I thought
kerchunk
could help me.
I wrote a small example below but it fails with an zarr
shape error which I do not understand. I get SingleHdf5ToZarr
running, but MultiZarrToZarr
fails. Does anyone see a fix for this? Is this example supposed to work?
import os
import zipfile
import kerchunk.hdf
import fsspec
import json
import xarray as xr
import numpy as np
# dummy data
ds = xr.DataArray(name='var', data=np.random.random((100,100)), dims=['lon','lat'], coords={'lon':range(100), 'lat':range(100)}).to_dataset()#.expand_dims('time')
urls =[f'test_{y}.nc' for y in range(2000,2011)]
for u in urls:
ds.to_netcdf(u)
so = dict(
anon=True, default_fill_cache=False, default_cache_type='first'
)
with zipfile.ZipFile("out.zip", mode="w") as zf:
for u in urls:
print(u)
print(os.path.exists(u))
with fsspec.open(u, **so) as inf:
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(inf, u, inline_threshold=100)#, mode='r')
with zf.open(os.path.basename(u) + ".json", 'w') as outf:
outf.write(json.dumps(h5chunks.translate()).encode())
from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
"zip://*.json::out.zip",
remote_protocol="file",
xarray_open_kwargs={
# "preprocess": None,#drop_coords,
"decode_cf": False,
"mask_and_scale": False,
"decode_times": False,
"decode_timedelta": False,
"use_cftime": False,
"decode_coords": False
},
xarray_concat_args={'dim':'time'}
)
#test_dict = mzz.translate()
# open single json as zarr
test_dict='test_2000.nc.json' # after unzip out.zip
m = fsspec.get_mapper('reference://', fo=test_dict, remote_protocoll='file')
print(xr.open_dataset(m, engine='zarr')) # works
mzz.translate("output.zarr") # see Traceback below
# This can also be written as a json
mzz.translate("output.json") # same Traceback
Output
(xr) aaron.spring@MacBook-Pro:~/Coding/kerchunk/my_kerchunk$ python run.py
test_2000.nc
True
test_2001.nc
True
test_2002.nc
True
test_2003.nc
True
test_2004.nc
True
test_2005.nc
True
test_2006.nc
True
test_2007.nc
True
test_2008.nc
True
test_2009.nc
True
test_2010.nc
True
/Users/aaron.spring/Coding/kerchunk/my_kerchunk/run.py:49: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
print(xr.open_dataset(m, engine='zarr'))
<xarray.Dataset>
Dimensions: (lat: 100, lon: 100)
Coordinates:
* lat (lat) float64 0.0 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
* lon (lon) float64 0.0 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
Data variables:
var (lon, lat) float64 ...
Traceback (most recent call last):
File "/Users/aaron.spring/Coding/kerchunk/my_kerchunk/run.py", line 52, in <module>
mzz.translate("output.zarr")
File "/Users/aaron.spring/Coding/kerchunk/kerchunk/combine.py", line 79, in translate
out = self._build_output(ds, ds0, fss)
File "/Users/aaron.spring/Coding/kerchunk/kerchunk/combine.py", line 185, in _build_output
acc_len = make_coord(fss, z, accum_dim)
File "/Users/aaron.spring/Coding/kerchunk/kerchunk/combine.py", line 301, in make_coord
zz = zarr.open_array(fs.get_mapper(accum_dim))
File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/creation.py", line 526, in open_array
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/storage.py", line 353, in init_array
_init_array_metadata(store, shape=shape, chunks=chunks, dtype=dtype,
File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/storage.py", line 390, in _init_array_metadata
shape = normalize_shape(shape) + dtype.shape
File "/Users/aaron.spring/mambaforge/envs/xr/lib/python3.9/site-packages/zarr/util.py", line 42, in normalize_shape
raise TypeError('shape is None')
TypeError: shape is None
Versions:
- zarr 2.10.3
- kerchunk 0.0.5+47.gf42b0c2 (current main)
- fsspec 2022.1.0
- xarray 0.20.2
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
From Climate Model Output to Actionable Climate Information ...
This approach consisting of using downscaled, bias-adjusted climate data to simulate hydrological variables is often referred to as the climate–hydrological ...
Read more >Using Weather Data and Climate Model Output in Economic ...
Abstract. There is a long history of using weather measures as explanatory variables in statistical models. For example, Fisher (1925) examined the effects ......
Read more >[2012.10394] Deep Learning for Climate Model Output Statistics
In this work, we explore the feasibility and potential of deep learning with convolutional neural networks (CNNs) for MOS. We propose the CNN ......
Read more >Accessing NetCDF and GRIB file collections as cloud-native ...
The Kerchunk library provides a way to map the internal layout and chunking ... as a platform for disseminating IOOS regional ocean forecast...
Read more >Using Weather Data and Climate Model Output in Economic ...
We then provide a brief overview of climate models and discuss two common and significant errors often made by economists when climate model...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This just adds “time” to the list of dims but doesn’t assign any values, so there’s no way to determine what the output “time” coordinate should be. Interestingly, in #122 I am developing a new version of combine, where you can define how to extract the concat coord(s) from each dataset. The following works, for example
where we use the
coo_map
argument we define the value of “time” to be equal to the sequential index of the inputs. The output dtype of “time” should be int64, not float64, but this is pretty good.On second thoughts, since the data is uncompressed, it may well be that h5py can read it directly without copy and without the GIL, whereas kerchunk/referenceFS deals with bytes. We could implement opening as mmap and passing around
memorybuffer
s.