question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Variable encoding lost in MultiZarrToZarr

See original GitHub issue

I have been playing around with fsspec-reference-maker in developing a tutorial example for https://github.com/pangeo-forge/pangeo-forge-recipes/pull/174.

I have discovered a problem related to how encoding is handled. Basically, when two variables in different files have different encoding, the MultiZarrToZarr reference filesystem assumes that all variables have the same encoding. However, the raw data have not been changed, so the incorrect encoding is applied.

This may seem like a niche case, but it is actually very common for timeseries netCDF files to have their time dimension encoded as an int with units days since X, where X is different in each file.

The following example is a simple standalone reproducer. The key point is that the two original datasets have the same raw data in the time coord, but different attributes.

import os
import json

import xarray as xr
import fsspec
from fsspec_reference_maker.combine import MultiZarrToZarr
from fsspec_reference_maker.hdf import SingleHdf5ToZarr

ds1 = xr.DataArray(
    [0, 0, 0],
    name='foo',
    dims=['time'],
    coords={
        'time': (
            ['time'],
            [1, 2, 3], 
            {'units': 'days since 1900-01-01', 'calendar': 'noleap'}
        )
    }
).to_dataset()
ds1.to_netcdf('ds1.nc', mode='w')

ds2 = xr.DataArray(
    [1, 1, 1],
    name='foo',
    dims=['time'],
    coords={
        'time': (
            ['time'],
            [1, 2, 3], 
            {'units': 'days since 1900-01-04', 'calendar': 'noleap'}
        )
    }
).to_dataset()
ds2.to_netcdf('ds2.nc', mode='w')

fnames = ['ds1.nc', 'ds2.nc']
ds_concat =xr.open_mfdataset(fnames).load()


for fname in fnames:
    json_fname = os.path.basename(fname + ".json")
    url = 'file://' + os.path.abspath(fname)
    with fsspec.open(fname) as f:
        h5chunks = SingleHdf5ToZarr(f, url, inline_threshold=300)
        chunks = h5chunks.translate()
    with open(json_fname, mode='wt') as f_out:
        json.dump(chunks, f_out)

dsets = []
for fname in fnames:
    ref_url = 'file://' + os.path.abspath(fname + ".json")
    m = fsspec.get_mapper(
        "reference://",
        fo=ref_url,
        target_protocol="file",
        remote_protocol="s3",
        skip_instance_cache=True,
    )
    dsets.append(xr.open_dataset(m, engine='zarr', backend_kwargs={'consolidated': False}))

ds_concat_ref = xr.concat(dsets, dim='time')

json_files = ['file://' + os.path.abspath(fname + ".json") for fname in fnames]
mzz = MultiZarrToZarr(
    json_files,
    xarray_concat_args={'dim': 'time'},
    remote_protocol='file'
)
out = mzz.translate(None)
with open('combined.json', 'wt') as fp:
    json.dump(out, fp)

m_combined = fsspec.get_mapper(
    "reference://",
    fo='combined.json',
    target_protocol="file",
    remote_protocol="file",
    skip_instance_cache=True,
)
ds_combined = xr.open_dataset(m_combined, engine='zarr', backend_kwargs={'consolidated': False})

print(ds_concat.time)
print(ds_concat_ref.time)
print(ds_combined.time)

xr.testing.assert_equal(ds_concat.time, ds_concat_ref.time)
xr.testing.assert_equal(ds_concat.time, ds_combined.time)
AssertionError: Left and right DataArray objects are not equal

Differing values:
L
    array([cftime.DatetimeNoLeap(1900, 1, 2, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 6, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 7, 0, 0, 0, 0, has_year_zero=True)],
          dtype=object)
R
    array([cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 3, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 4, 0, 0, 0, 0, has_year_zero=True),
           cftime.DatetimeNoLeap(1900, 1, 5, 0, 0, 0, 0, has_year_zero=True)],
          dtype=object)
Differing coordinates:
L * time     (time) object 1900-01-02 00:00:00 ... 1900-01-07 00:00:00
R * time     (time) object 1900-01-03 00:00:00 ... 1900-01-05 00:00:00

Digging deeper, we can see that each individual SingleHdf5ToZarr reference dataset preserves the correct encoding. But the combined dataset has a new encoding (determined by xarray) which doesn’t match the actual encoding values on disk.

print(dsets[0].time.encoding)
print(dsets[1].time.encoding)
print(ds_combined.time.encoding)
{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-01', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (3,), 'preferred_chunks': {'time': 3}, 'compressor': None, 'filters': None, '_FillValue': -9223372036854775806, 'units': 'days since 1900-01-04', 'calendar': 'noleap', 'dtype': dtype('int64')}
{'chunks': (6,), 'preferred_chunks': {'time': 6}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, '_FillValue': 0, 'units': 'days since 1900-01-02 00:00:00.000000', 'calendar': 'noleap', 'dtype': dtype('int64')}

This is a pretty tricky situation. The whole point of the reference filesystem is to avoid copying the data. But there is no way for an xarray dataset to use different encoding for different parts of the same array. So the best we could do in this case would be to raise an error when encoding is inconsistent. Alternatively, for small data (< inline_threshold), we could consider reencoding data with uniform encoding, like xarray does when it writes data to disk.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Apr 20, 2022

I think so, yes. It certainly appeared to give the right results in the tutorial notebook.

1reaction
cisaacsterncommented, Aug 25, 2021

a tutorial example for pangeo-forge/pangeo-forge-recipes#167

I think you probably mean https://github.com/pangeo-forge/pangeo-forge-recipes/pull/174?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Variable encoding lost in MultiZarrToZarr · Issue #69 - GitHub
I have discovered a problem related to how encoding is handled. Basically, when two variables in different files have different encoding, the ...
Read more >
Unable While Encoding Categorical variable , Futurewarning ...
Before we perform imputation, let's check the no. of missing values in each categorical column. raw_df[categorical].isna().sum()
Read more >
All about Categorical Variable Encoding | by Baijayanta Roy
Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values.
Read more >
Encoding of categorical variables and imputing
Hello, We have built a binary classification model(0/1 problem). While going through the Data quality handling report, I could see that some.
Read more >
A guide to encoding categorical features using R | R-bloggers
Categorical feature encoding is an important data processing step ... Some observations may have missing values for this variable. hml - A ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found