Variable encoding lost in MultiZarrToZarr
See original GitHub issueI have been playing around with fsspec-reference-maker in developing a tutorial example for https://github.com/pangeo-forge/pangeo-forge-recipes/pull/174.
I have discovered a problem related to how encoding is handled. Basically, when two variables in different files have different encoding, the MultiZarrToZarr
reference filesystem assumes that all variables have the same encoding. However, the raw data have not been changed, so the incorrect encoding is applied.
This may seem like a niche case, but it is actually very common for timeseries netCDF files to have their time dimension encoded as an int with units days since X
, where X
is different in each file.
The following example is a simple standalone reproducer. The key point is that the two original datasets have the same raw data in the time
coord, but different attributes.
import os
import json
import xarray as xr
import fsspec
from fsspec_reference_maker.combine import MultiZarrToZarr
from fsspec_reference_maker.hdf import SingleHdf5ToZarr
ds1 = xr.DataArray(
[0, 0, 0],
name='foo',
dims=['time'],
coords={
'time': (
['time'],
[1, 2, 3],
{'units': 'days since 1900-01-01', 'calendar': 'noleap'}
)
}
).to_dataset()
ds1.to_netcdf('ds1.nc', mode='w')
ds2 = xr.DataArray(
[1, 1, 1],
name='foo',
dims=['time'],
coords={
'time': (
['time'],
[1, 2, 3],
{'units': 'days since 1900-01-04', 'calendar': 'noleap'}
)
}
).to_dataset()
ds2.to_netcdf('ds2.nc', mode='w')
fnames = ['ds1.nc', 'ds2.nc']
ds_concat =xr.open_mfdataset(fnames).load()
for fname in fnames:
json_fname = os.path.basename(fname + ".json")
url = 'file://' + os.path.abspath(fname)
with fsspec.open(fname) as f:
h5chunks = SingleHdf5ToZarr(f, url, inline_threshold=300)
chunks = h5chunks.translate()
with open(json_fname, mode='wt') as f_out:
json.dump(chunks, f_out)
dsets = []
for fname in fnames:
ref_url = 'file://' + os.path.abspath(fname + ".json")
m = fsspec.get_mapper(
"reference://",
fo=ref_url,
target_protocol="file",
remote_protocol="s3",
skip_instance_cache=True,
)
dsets.append(xr.open_dataset(m, engine='zarr', backend_kwargs={'consolidated': False}))
ds_concat_ref = xr.concat(dsets, dim='time')
json_files = ['file://' + os.path.abspath(fname + ".json") for fname in fnames]
mzz = MultiZarrToZarr(
json_files,
xarray_concat_args={'dim': 'time'},
remote_protocol='file'
)
out = mzz.translate(None)
with open('combined.json', 'wt') as fp:
json.dump(out, fp)
m_combined = fsspec.get_mapper(
"reference://",
fo='combined.json',
target_protocol="file",
remote_protocol="file",
skip_instance_cache=True,
)
ds_combined = xr.open_dataset(m_combined, engine='zarr', backend_kwargs={'consolidated': False})
print(ds_concat.time)
print(ds_concat_ref.time)
print(ds_combined.time)
xr.testing.assert_equal(ds_concat.time, ds_concat_ref.time)
xr.testing.assert_equal(ds_concat.time, ds_combined.time)
AssertionError: Left and right DataArray objects are not equal
Differing values:
L
array([cftime.DatetimeNoLeap(1900, 1, 2, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 6, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 7, 0, 0, 0, 0, has_year_zero=True)],
dtype=object)
R
array([cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True)],
dtype=object)
Differing coordinates:
L * time (time) object 1900-01-02 00:00:00 ... 1900-01-07 00:00:00
R * time (time) object 1900-01-03 00:00:00 ... 1900-01-05 00:00:00
Digging deeper, we can see that each individual SingleHdf5ToZarr
reference dataset preserves the correct encoding. But the combined dataset has a new encoding (determined by xarray) which doesn’t match the actual encoding values on disk.
print(dsets[0].time.encoding)
print(dsets[1].time.encoding)
print(ds_combined.time.encoding)
{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-01', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-04', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (6,), 'preferred_chunks': {'time': 6}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': 0, 'units': 'days since 1900-01-02 00:00:00.000000', 'calendar': 'noleap', 'dtype': dtype('int64')}
This is a pretty tricky situation. The whole point of the reference filesystem is to avoid copying the data. But there is no way for an xarray dataset to use different encoding for different parts of the same array. So the best we could do in this case would be to raise an error when encoding is inconsistent. Alternatively, for small data (< inline_threshold
), we could consider reencoding data with uniform encoding, like xarray does when it writes data to disk.
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
I think so, yes. It certainly appeared to give the right results in the tutorial notebook.
I think you probably mean https://github.com/pangeo-forge/pangeo-forge-recipes/pull/174?