question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Writing a a dataset to .zarr in a loop makes all the data NaNs

See original GitHub issue

What happened?

I have lots (61) pickled pandas dataframes that I’m trying to convert from pickle/pandas to zarr/xarray. Since the dataframes are large (10000x2048) I can’t load them all into memory. To get around this I’m (MCVE below) looping through the pickle files, reading them into dataframes, constructing DataArrays and then Datasets from the data, concatinating the dataset with the previous dataset and updating the dataset to point to this new concatenated dataset.

Since I didn’t want to use up too much memory, I’m also periodically writing the Dataset to .zarr in the loop and reopening it (hoping to make use of dask storing data on disk?).

When I do this however, the final dataset ends up being all NaNs.

What did you expect to happen?

I expected the final dataset to contain all the concatenated data.

Minimal Complete Verifiable Example

import pandas as pd
import numpy as np
import glob
import xarray as xr
from tqdm import tqdm

# Creating pkl files
[pd.DataFrame(np.random.randint(0,10, (1000,500))).astype(object).to_pickle('df{}.pkl'.format(i)) for i in range(4)]

fnames = glob.glob('*.pkl')

df = pd.read_pickle(fnames[0])
df.columns = np.arange(0,500).astype(object) # the real pkl files contain all objects
df.index = np.arange(0,1000).astype(object)
df = df.astype(np.float32)

ds = xr.DataArray(df.values, dims=['fname', 'res_dim'],
                  coords={'fname': df.index.values, 'res_dim': df.columns.values})
ds = ds.to_dataset(name='low_dim')

for idx, fname in enumerate(tqdm(fnames[1:])):
    df = pd.read_pickle(fname)
    df.columns = np.arange(0,500).astype(object)
    df.index = np.arange(0,1000).astype(object)
    df = df.astype(np.float32)


    ds2 = xr.DataArray(df.values, dims=['fname', 'res_dim'],
                  coords={'fname': df.index.values, 'res_dim': df.columns.values})
    ds2 = ds2.to_dataset(name='low_dim')

    ds = xr.concat([ds, ds2], dim='fname')
    ds['fname'] = ds.fname.astype(str)
    if (idx%2 == 0) & (idx !=0):
        ds.to_zarr('zarr_bug.zarr', mode='w')
        ds = xr.open_zarr('zarr_bug.zarr')

ds.to_zarr('zarr_bug.zarr', mode='w')
ds = xr.open_zarr('zarr_bug.zarr')

print(ds.low_dim.values)

Relevant log output

[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]

Anything else we need to know?

If I get rid of the loop saving, everything works normally.

Environment

INSTALLED VERSIONS

commit: None python: 3.9.11 (main, Mar 28 2022, 10:10:35) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.11.0-27-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: (‘en_US’, ‘UTF-8’) libhdf5: 1.12.0 libnetcdf: 4.7.4

xarray: 2022.3.0 pandas: 1.4.1 numpy: 1.21.0 scipy: 1.8.0 netCDF4: 1.5.8 pydap: installed h5netcdf: 1.0.0 h5py: 3.6.0 Nio: None zarr: 2.11.1 cftime: 1.6.0 nc_time_axis: None PseudoNetCDF: None rasterio: 1.2.10 cfgrib: 0.9.10.1 iris: None bottleneck: None dask: 2022.03.0 distributed: 2022.3.0 matplotlib: 3.5.1 cartopy: None seaborn: 0.11.2 numbagg: None fsspec: 2022.02.0 cupy: None pint: None sparse: None setuptools: 58.0.4 pip: 21.2.4 conda: None pytest: None IPython: 8.1.1 sphinx: None

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
delgadomcommented, Apr 10, 2022

@tbloch1 following up on Max’s suggestion - it looks like you might be overwriting the file with every iteration. See the docs on ds.to_zarr - mode='w' will overwrite the file while mode='a' will append. That said, you still would need your indices to not overlap. How are you distinguishing between the files? is each one a different point in time?

To me, this doesn’t seem likely to be a bug, but is more of a usage question. Have you tried asking on stackoverflow with the xarray tag?

0reactions
max-sixtycommented, Apr 14, 2022

@max-sixty could you explain which bit isn’t working for you? The initial example I shared works fine in colab for me, so that might be a you problem. The second one required specifying the chunks when making the datasets (I’ve editted above).

Right, you changed the example after I responded

But this bug report was more about the fact that overwriting was converting data to NaNs (in two different ways depending on the code apparently).

In my case there is no longer any need to do the overwriting, but this doesn’t seem like the expected behaviour of overwriting, and I’m sure there are some valid reasons to overwrite data - hence me opening the bug report.

Something surprising is indeed going on here. To focus on the surprising part;

print(ds3.low_dim.values)

ds3.to_zarr('zarr_bug.zarr', mode='w')

print(ds3.low_dim.values)

returns:

[[2. 3. 2. ... 8. 0. 9.]
 [6. 2. 6. ... 2. 4. 3.]
 [0. 8. 8. ... 6. 5. 4.]
 ...
 [1. 0. 5. ... 2. 0. 3.]
 [5. 5. 7. ... 9. 6. 2.]
 [5. 7. 8. ... 4. 8. 9.]]
[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [ 1.  0.  5. ...  2.  0.  3.]
 [ 5.  5.  7. ...  9.  6.  2.]
 [ 5.  7.  8. ...  4.  8.  9.]]

Similarly:


In [50]: ds3.low_dim.count().compute()
Out[50]:
<xarray.DataArray 'low_dim' ()>
array(1000000)

In [51]: ds3.to_zarr('zarr_bug.zarr', mode='w')
Out[51]: <xarray.backends.zarr.ZarrStore at 0x16a27c6d0>

In [55]: ds3.low_dim.count().compute()
Out[55]:
<xarray.DataArray 'low_dim' ()>
array(500000)

So it’s changing the result in memory just from writing to the Zarr store. I’m not sure what the cause is.

We can still massively reduce the size of this example — it’s currently doing pickling, got a bunch of repeated code, etc. Does it work without the pickling? What if ds3 = xr.concat([ds1, ds1.copy(deep=True)]), etc.

Read more comments on GitHub >

github_iconTop Results From Across the Web

xarray.Dataset.to_zarr
Write dataset contents to a zarr group. Zarr chunks are determined in the following way: From the chunks attribute in each variable's ...
Read more >
Create Zarr dataset from PP files | Met Office Informatics Lab
Create a Zarr dataset from thousands of Met Office PP files using ... all data variables will be the same when written to...
Read more >
Zarr Bi-weekly Community Call - HackMD
was asked to add ZarrWriter in aquisition package. Can then easily transfer to data storage. DB: easier to transfer than HDF5? No, than...
Read more >
How do I encode NaN values in xarray / zarr with integer dtype?
I tried using numcodecs.FixedScaleOffset(astype='u2') filter but this stores all NaNs as zero. Since the data also contains zeros as valid ...
Read more >
How to remove NaNs from a simulated data series generated ...
I am writing a for loop to generate the data series according to the ... The code does generate data series for the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found