question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`to_zarr` with append or region mode and `_FillValue` doesnt work

See original GitHub issue

What happened?

import numpy as np
import xarray as xr
ds = xr.Dataset({"a": ("x", [3.], {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m)
ds.to_zarr(m, append_dim="x")

raises

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

What did you expect to happen?

I’d expect this to just work (effectively concatenating the dataset to itself).

Anything else we need to know?

appears also for region writes

The same issue appears for region writes as in:

import numpy as np
import dask.array as da
import xarray as xr
ds = xr.Dataset({"a": ("x", da.array([3.,4.]), {"_FillValue": np.nan})})
m = {}
ds.to_zarr(m, compute=False, encoding={"a": {"chunks": (1,)}})
ds.isel(x=slice(0,1)).to_zarr(m, region={"x": slice(0,1)})

raises

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

there’s a workaround

The workaround (deleting the _FillValue in subsequent writes):

m = {}
ds.to_zarr(m)
del ds.a.attrs["_FillValue"]
ds.to_zarr(m, append_dim="x")

seems to do the trick.

There are indications that the result might still be broken, but it’s not yet clear how to reproduce them (see comments below).

This issue has been split off from #6069

Environment

INSTALLED VERSIONS

commit: None python: 3.9.10 (main, Jan 15 2022, 11:48:00) [Clang 13.0.0 (clang-1300.0.29.3)] python-bits: 64 OS: Darwin OS-release: 20.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: de_DE.UTF-8 LOCALE: (‘de_DE’, ‘UTF-8’) libhdf5: 1.12.0 libnetcdf: 4.7.4

xarray: 0.20.1 pandas: 1.2.0 numpy: 1.21.2 scipy: 1.6.2 netCDF4: 1.5.8 pydap: installed h5netcdf: 0.11.0 h5py: 3.2.1 Nio: None zarr: 2.11.0 cftime: 1.3.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: None iris: None bottleneck: None dask: 2021.11.1 distributed: 2021.11.1 matplotlib: 3.4.1 cartopy: 0.20.1 seaborn: 0.11.1 numbagg: None fsspec: 2021.11.1 cupy: None pint: 0.17 sparse: 0.13.0 setuptools: 60.5.0 pip: 21.3.1 conda: None pytest: 6.2.2 IPython: 8.0.0.dev sphinx: 3.5.0

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:17 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
d70-tcommented, Mar 11, 2022

Thanks for pointing out region again. I’ve updated the header and the initial comment.

1reaction
d70-tcommented, Mar 10, 2022

Yes, this is kind of the behaviour I’d expect. And great that it helped clarifying things. Still, building up the metadata nicely upfront (which is required for region writes) ist quite convoluted… That’s what I meant with

some better tooling for writing and updating zarr dataset metadata (I don’t know if that would fit in the realm of xarray though, as it looks like handling Datasets without content. For “appending” metadata, I really don’t know how I’d picture this propery in xarray world.)

in the previous comment. I think, establishing and documenting good practices for this would help, but probably we also want to have better tools. In any case, this would probably be yet another issue.

Note that if you care about this paricular example (e.g. appending in a single thread in increasing order of timesteps), then it should also be possible to do this much simpler using append:

filename='processed_dataset.zarr'
ds = xr.tutorial.open_dataset('air_temperature')
ds.air.encoding['dtype']=np.dtype('float32')
X,Y=250, 250 #size of each final timestep

for i in range(len(ds.time)):
    # some kind of heavy processing
    arr_r=some_processing(ds.isel(time=slice(i,i+1)),X,Y)
    del arr_r.air.attrs["_FillValue"]
    if os.path.exists(filename):
        arr_r.to_zarr(filename, append_dim='time')
    else:
        arr_r.to_zarr(filename)

If you find out more about the cloud case, please post a note, otherwise, we can assume that the original bug report is fine?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using to_zarr(region=) and extending the time dimension?
It's just json files and binary chunks! You can either use region= or extend along a single dimension with to_zarr() , but you...
Read more >
Reading and writing files - Xarray
To add or overwrite entire variables, simply call to_zarr() with mode='a' ; To resize and then append values along an existing dimension in...
Read more >
xarray.Dataset.to_zarr: docs for “Appending to existing Zarr ...
In the description of the ' region ' argument to xarray.Dataset.to_zarr the last sentence states: See “Appending to existing Zarr stores” in ...
Read more >
zarr.core — zarr 0.1.dev50 documentation - Read the Docs
If False (default), each chunk is compared to the array's fill value prior to ... out : ndarray A NumPy array containing the...
Read more >
sharrow.dataset — v2.4.0 - ActivitySim
If the dataframe's index is a MultiIndex, it will be expanded into a tensor product of one-dimensional indices (filling in missing values with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found