Writing a a dataset to .zarr in a loop makes all the data NaNs
See original GitHub issueWhat happened?
I have lots (61) pickled pandas dataframes that I’m trying to convert from pickle/pandas to zarr/xarray. Since the dataframes are large (10000x2048) I can’t load them all into memory. To get around this I’m (MCVE below) looping through the pickle files, reading them into dataframes, constructing DataArrays and then Datasets from the data, concatinating the dataset with the previous dataset and updating the dataset to point to this new concatenated dataset.
Since I didn’t want to use up too much memory, I’m also periodically writing the Dataset to .zarr in the loop and reopening it (hoping to make use of dask storing data on disk?).
When I do this however, the final dataset ends up being all NaNs.
What did you expect to happen?
I expected the final dataset to contain all the concatenated data.
Minimal Complete Verifiable Example
import pandas as pd
import numpy as np
import glob
import xarray as xr
from tqdm import tqdm
# Creating pkl files
[pd.DataFrame(np.random.randint(0,10, (1000,500))).astype(object).to_pickle('df{}.pkl'.format(i)) for i in range(4)]
fnames = glob.glob('*.pkl')
df = pd.read_pickle(fnames[0])
df.columns = np.arange(0,500).astype(object) # the real pkl files contain all objects
df.index = np.arange(0,1000).astype(object)
df = df.astype(np.float32)
ds = xr.DataArray(df.values, dims=['fname', 'res_dim'],
coords={'fname': df.index.values, 'res_dim': df.columns.values})
ds = ds.to_dataset(name='low_dim')
for idx, fname in enumerate(tqdm(fnames[1:])):
df = pd.read_pickle(fname)
df.columns = np.arange(0,500).astype(object)
df.index = np.arange(0,1000).astype(object)
df = df.astype(np.float32)
ds2 = xr.DataArray(df.values, dims=['fname', 'res_dim'],
coords={'fname': df.index.values, 'res_dim': df.columns.values})
ds2 = ds2.to_dataset(name='low_dim')
ds = xr.concat([ds, ds2], dim='fname')
ds['fname'] = ds.fname.astype(str)
if (idx%2 == 0) & (idx !=0):
ds.to_zarr('zarr_bug.zarr', mode='w')
ds = xr.open_zarr('zarr_bug.zarr')
ds.to_zarr('zarr_bug.zarr', mode='w')
ds = xr.open_zarr('zarr_bug.zarr')
print(ds.low_dim.values)
Relevant log output
[[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
...
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]
[nan nan nan ... nan nan nan]]
Anything else we need to know?
If I get rid of the loop saving, everything works normally.
Environment
INSTALLED VERSIONS
commit: None python: 3.9.11 (main, Mar 28 2022, 10:10:35) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.11.0-27-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: (‘en_US’, ‘UTF-8’) libhdf5: 1.12.0 libnetcdf: 4.7.4
xarray: 2022.3.0 pandas: 1.4.1 numpy: 1.21.0 scipy: 1.8.0 netCDF4: 1.5.8 pydap: installed h5netcdf: 1.0.0 h5py: 3.6.0 Nio: None zarr: 2.11.1 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: 0.9.10.1 iris: None bottleneck: None dask: 2022.03.0 distributed: 2022.3.0 matplotlib: 3.5.1 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: 2022.02.0 cupy: None pint: None sparse: None setuptools: 58.0.4 pip: 21.2.4 conda: None pytest: None IPython: 8.1.1 sphinx: None
Issue Analytics
- State:
- Created a year ago
- Comments:10 (2 by maintainers)
@tbloch1 following up on Max’s suggestion - it looks like you might be overwriting the file with every iteration. See the docs on ds.to_zarr -
mode='w'
will overwrite the file whilemode='a'
will append. That said, you still would need your indices to not overlap. How are you distinguishing between the files? is each one a different point in time?To me, this doesn’t seem likely to be a bug, but is more of a usage question. Have you tried asking on stackoverflow with the xarray tag?
Right, you changed the example after I responded
Something surprising is indeed going on here. To focus on the surprising part;
returns:
Similarly:
So it’s changing the result in memory just from writing to the Zarr store. I’m not sure what the cause is.
We can still massively reduce the size of this example — it’s currently doing pickling, got a bunch of repeated code, etc. Does it work without the pickling? What if
ds3 = xr.concat([ds1, ds1.copy(deep=True)])
, etc.