Loading from NetCDF creates unnecessary numpy.ndarray-views that clears the OWNDATA-flag
See original GitHub issueWhat happened?
When loading a NetCDF4 dataset from a file (at least using the ‘h5netcdf’ engine) I get a xarray.Dataset where each DataArray wraps a numpy.ndarray with the OWNDATA flag set to False. This is counter-intuitive as the high-level user has not knowingly run anything that would create a view/slice to get a second array sharing memory with the “first” array.
This is of course a rather minor issue, but it annoyed me when I was making tools to keep track of which arrays in my dataset were using much RAM, because I had made an option to only show memory usage for the primary/base array, not for possible extra array views that reuse the same memory. With this option enabled, the reported issue however prevents me from getting useful information (just shows nearly zero memory usage) when inspecting a Dataset loaded from NetCDF4-file instead of a “freshly” created or deep-copied Dataset.
What did you expect to happen?
I would prefer the OWNDATA flag to be True as they are from the lower-level h5py-reading of the HDF5-file. After some debugging this means avoiding to do things like array = array[:, …] at various places in the multiple layers of wrappers that are involved in the dataset-loading, as that would create and return a new ndarray-instance “sharing” (not owning) memory with the original (while no user-accessible reference to the original ndarray that technically “owns” the memory seems to be retained).
See the minimal code example below. It and my patch were made using xarray 0.20.1 with h5netcdf 0.12.0 but looking it looks like the relevant parts of xarray/core/indexing.py are still the same.
Minimal Complete Verifiable Example
import xarray as xr
import numpy as np
ds = xr.Dataset()
ds['stuff'] = xr.DataArray(np.random.randn(2), dims='x')
path = 'demo.nc'
ds.to_netcdf(path, engine='h5netcdf', format='netCDF4', invalid_netcdf=True)
ds2 = xr.load_dataset(path, engine='h5netcdf')
print(ds2['stuff'].values.flags['OWNDATA']) # initially False, True after patching
Relevant log output
No response
Anything else we need to know?
I patched two parts of xarray/core/indexing.py to solves the issue: xarray core indexing.diff.txt
Testing in other situations will of course be needed to make sure this doesn’t disturb anything else, but I hope the general idea would be useful even if some more condition might be needed for when to take the shortcut of returning the original numpy.ndarray rather than a view of it.
Environment
xarray: 0.20.1 pandas: 1.3.4 numpy: 1.21.5 scipy: 1.7.3 netCDF4: None pydap: None h5netcdf: 0.12.0 h5py: 2.10.0 Nio: None zarr: None cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.2 dask: 2021.10.0 distributed: 2021.10.0 matplotlib: 3.3.2 cartopy: None seaborn: None numbagg: None fsspec: 0.8.3 cupy: None pint: 0.16.1 sparse: None setuptools: 52.0.0.post20210125 pip: 20.3.3 conda: 4.12.0 pytest: 6.2.2 IPython: 7.20.0 sphinx: 3.4.3
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
FYI: Since h5netcdf recently moved to version 1.0, I’ve checked with latest xarray (2022.3.0) and latest h5netcdf (1.0.0). The OP example with the OP fix reproduces nicely as well with the updated fix.
You may be right that the OWNDATA-flag is more of an internal numpy thing for its memory management, and that there is no general requirement or guarantee that higher-level libraries should avoid creating “unnecessary” layers of views.
I had just gotten used to nice behaviour form the other xarray’s operations I was using (isel() and []-slicing created views as expected, while e.g. sel() and mean() which create array copies did not create any unnecessary view on top of those).
While not creating extra view-objects for viewing the entire array could also be seen as an optimization, the net benefit is not obvious since the extra checks in the if-cases of my patch add some work too. (And of course a risk that a change deep down in the indexing methods has unintended consequences.)
I would thus be OK with closing this issue as “won’t fix”, which I supposed you were heading towards unless a demand from others would appear.
I followed your suggestion and changed my memory_size()-function to not just care about whether the OWNDATA is True/False (or probably equivalently whether ndarray.base is None or not), but recursively following the ndarray.base.base… and tracking the id() of objects to avoid counting the same more than once. The new version behaves differently: When called on a single DataArray whose data was defined by slicing something else, it counts the size of the full base array instead of 0 (or about 100 bytes overhead) as before, but within a Dataset (or optionally a set of multiple Datasets) any other reference to the same base array won’t be counted again. I can live with this new more “relative” than “absolute” definition of were memory is considered “shared”.