Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Variable encoding lost in MultiZarrToZarr

See original GitHub issue

I have been playing around with fsspec-reference-maker in developing a tutorial example for https://github.com/pangeo-forge/pangeo-forge-recipes/pull/174.

I have discovered a problem related to how encoding is handled. Basically, when two variables in different files have different encoding, the MultiZarrToZarr reference filesystem assumes that all variables have the same encoding. However, the raw data have not been changed, so the incorrect encoding is applied.

This may seem like a niche case, but it is actually very common for timeseries netCDF files to have their time dimension encoded as an int with units days since X, where X is different in each file.

The following example is a simple standalone reproducer. The key point is that the two original datasets have the same raw data in the time coord, but different attributes.

import os
import json

import xarray as xr
import fsspec
from fsspec_reference_maker.combine import MultiZarrToZarr
from fsspec_reference_maker.hdf import SingleHdf5ToZarr

ds1 = xr.DataArray(
    [0, 0, 0],
    name='foo',
    dims=['time'],
    coords={
        'time': (
            ['time'],
            [1, 2, 3], 
            {'units': 'days since 1900-01-01', 'calendar': 'noleap'}
        )
    }
).to_dataset()
ds1.to_netcdf('ds1.nc', mode='w')

ds2 = xr.DataArray(
    [1, 1, 1],
    name='foo',
    dims=['time'],
    coords={
        'time': (
            ['time'],
            [1, 2, 3], 
            {'units': 'days since 1900-01-04', 'calendar': 'noleap'}
        )
    }
).to_dataset()
ds2.to_netcdf('ds2.nc', mode='w')

fnames = ['ds1.nc', 'ds2.nc']
ds_concat =xr.open_mfdataset(fnames).load()


for fname in fnames:
    json_fname = os.path.basename(fname + ".json")
    url = 'file://' + os.path.abspath(fname)
    with fsspec.open(fname) as f:
        h5chunks = SingleHdf5ToZarr(f, url, inline_threshold=300)
        chunks = h5chunks.translate()
    with open(json_fname, mode='wt') as f_out:
        json.dump(chunks, f_out)

dsets = []
for fname in fnames:
    ref_url = 'file://' + os.path.abspath(fname + ".json")
    m = fsspec.get_mapper(
        "reference://",
        fo=ref_url,
        target_protocol="file",
        remote_protocol="s3",
        skip_instance_cache=True,
    )
    dsets.append(xr.open_dataset(m, engine='zarr', backend_kwargs={'consolidated': False}))

ds_concat_ref = xr.concat(dsets, dim='time')

json_files = ['file://' + os.path.abspath(fname + ".json") for fname in fnames]
mzz = MultiZarrToZarr(
    json_files,
    xarray_concat_args={'dim': 'time'},
    remote_protocol='file'
)
out = mzz.translate(None)
with open('combined.json', 'wt') as fp:
    json.dump(out, fp)

m_combined = fsspec.get_mapper(
    "reference://",
    fo='combined.json',
    target_protocol="file",
    remote_protocol="file",
    skip_instance_cache=True,
)
ds_combined = xr.open_dataset(m_combined, engine='zarr', backend_kwargs={'consolidated': False})

print(ds_concat.time)
print(ds_concat_ref.time)
print(ds_combined.time)

xr.testing.assert_equal(ds_concat.time, ds_concat_ref.time)
xr.testing.assert_equal(ds_concat.time, ds_combined.time)

AssertionError: Left and right DataArray objects are not equal

Differing values:
L
    array([cftime.DatetimeNoLeap(1900, 1, 2, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 6, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 7, 0, 0, 0, 0, has_year_zero=True)],
          dtype=object)
R
    array([cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True)],
          dtype=object)
Differing coordinates:
L * time     (time) object 1900-01-02 00:00:00 ... 1900-01-07 00:00:00
R * time     (time) object 1900-01-03 00:00:00 ... 1900-01-05 00:00:00

Digging deeper, we can see that each individual SingleHdf5ToZarr reference dataset preserves the correct encoding. But the combined dataset has a new encoding (determined by xarray) which doesn’t match the actual encoding values on disk.

print(dsets[0].time.encoding)
print(dsets[1].time.encoding)
print(ds_combined.time.encoding)

{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-01', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-04', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (6,), 'preferred_chunks': {'time': 6}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': 0, 'units': 'days since 1900-01-02 00:00:00.000000', 'calendar': 'noleap', 'dtype': dtype('int64')}

This is a pretty tricky situation. The whole point of the reference filesystem is to avoid copying the data. But there is no way for an xarray dataset to use different encoding for different parts of the same array. So the best we could do in this case would be to raise an error when encoding is inconsistent. Alternatively, for small data (< inline_threshold), we could consider reencoding data with uniform encoding, like xarray does when it writes data to disk.