jupyter repr caching deleted netcdf file
See original GitHub issueWhat happened:
Testing xarray data storage in a jupyter notebook with varying data sizes and storing to a netcdf, i noticed that open_dataset/array (both show this behaviour) continue to return data from the first testing run, ignoring the fact that each run deletes the previously created netcdf file.
This only happens once the repr
was used to display the xarray object.
But once in error mode, even the previously fine printed objects are then showing the wrong data.
This was hard to track down as it depends on the precise sequence in jupyter.
What you expected to happen:
when i use open_dataset/array
, the resulting object should reflect reality on disk.
Minimal Complete Verifiable Example:
import xarray as xr
from pathlib import Path
import numpy as np
def test_repr(nx):
ds = xr.DataArray(np.random.rand(nx))
path = Path("saved_on_disk.nc")
if path.exists():
path.unlink()
ds.to_netcdf(path)
return path
When executed in a cell with print for display, all is fine:
test_repr(4)
print(xr.open_dataset("saved_on_disk.nc"))
test_repr(5)
print(xr.open_dataset("saved_on_disk.nc"))
but as soon as one cell used the jupyter repr:
xr.open_dataset("saved_on_disk.nc")
all future file reads, even after executing the test function again and even using print
and not repr
, show the data from the last repr use.
Anything else we need to know?:
Here’s a notebook showing the issue: https://gist.github.com/05c2542ed33662cdcb6024815cc0c72c
Environment:
Output of <tt>xr.show_versions()</tt>
INSTALLED VERSIONS
commit: None python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] python-bits: 64 OS: Linux OS-release: 5.4.0-40-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4
xarray: 0.16.0 pandas: 1.0.5 numpy: 1.19.0 scipy: 1.5.1 netCDF4: 1.5.3 pydap: None h5netcdf: None h5py: 2.10.0 Nio: None zarr: None cftime: 1.2.1 nc_time_axis: None PseudoNetCDF: None rasterio: 1.1.5 cfgrib: None iris: None bottleneck: None dask: 2.21.0 distributed: 2.21.0 matplotlib: 3.3.0 cartopy: 0.18.0 seaborn: 0.10.1 numbagg: None pint: None setuptools: 49.2.0.post20200712 pip: 20.1.1 conda: installed pytest: 6.0.0rc1 IPython: 7.16.1 sphinx: 3.1.2
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (4 by maintainers)
Running
xarray.backends.file_manager.FILE_CACHE.clear()
fixed the issue for me. I couldn’t find any other way to stop xarray from pulling up some old data from a newly saved file. I’m using the h5netcdf engine with xarray version 2022.6.0 by the way.Probably the easiest work around is to call
.close()
on the original dataset. Failing that, the file is cached inxarray.backends.file_manager.FILE_CACHE
, which you could muck around with.I believe it only gets activated by
repr()
because array values from netCDF file are loaded lazily. Not 100% without more testing, though.