open_mfdataset memory leak, very simple case. v0.12
See original GitHub issueMCVE Code Sample
import glob
import xarray as xr
import numpy as np
from memory_profiler import profile
def CreateTestFiles():
# create a bunch of files
xlen = int(1e2)
ylen = int(1e2)
xdim = np.arange(xlen)
ydim = np.arange(ylen)
nfiles = 100
for i in range(nfiles):
data = np.random.rand(xlen, ylen, 1)
datafile = xr.DataArray(data, coords=[xdim, ydim, i], dims=['x', 'y', 'time'])
datafile.to_netcdf('testfiles/datafile_{}.nc'.format(i))
@profile
def ReadFiles():
xr.open_mfdataset(glob.glob('testfiles/*'), concat_dim='time')
if __name__ == '__main__':
# write out files for testing
CreateTestFiles()
# loop thru file read step
for i in range(100):
ReadFiles()
~
~
usage:
mprof run simplest_case.py mprof plot
(mprof is a python memory profiling library)
Problem Description
dask version 1.1.4 xarray version 0.12 python 3.7.3
There appears to be a persistent memory leak in open_mfdataset. I’m creating a model calibration script that runs for ~1000 iterations, opening and closing the same set of files (dimensions are the same, but the data is different) with each iteration. I eventually run out of memory because of the leak. This simple case captures the same behavior. Closing the files with .close() does not fix the problem.
Is there a work around for this? I’ve perused some of the issues but cannot tell if this has been resolved.
Output of xr.show_versions()
INSTALLED VERSIONS
commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.17.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2
xarray: 0.12.0 pandas: 0.24.2 numpy: 1.16.2 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: 1.5.5 zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 1.1.4 distributed: 1.26.0 matplotlib: 3.0.2 cartopy: 0.17.0 seaborn: None setuptools: 41.0.1 pip: 19.1.1 conda: None pytest: None IPython: 7.3.0 sphinx: None
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Thanks for the profiling script. I ran a few permutations of this:
xarray.open_mfdataset
withengine='netcdf4'
(default)xarray.open_mfdataset
withengine='h5netcdf'
xarray.open_dataset
withengine='netcdf4'
(default)xarray.open_dataset
withengine='h5netcdf'
Here are some plots:
xarray.open_mfdataset
withengine='netcdf4'
: pretty noticeable memory leak, about 0.5 MB /open_mfdataset
call:xarray.open_mfdataset
withengine='h5netcdf'
: looks like a small memory leak, about 0.1 MB /open_mfdataset
call:xarray.open_dataset
withengine='netcdf4'
(default): definitely has a memory leak:xarray.open_dataset
withengine='h5netcdf'
: does not appear to have a memory leak:So in conclusion, it looks like there are memory leaks:
netCDF4.Dataset
)xarray.open_mfdataset
(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files.
(2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using
open_mfdataset
, I suspect it may have more to do with the interaction with Dask, though to be honest I’m not exactly sure how.Note: I modified your script to xarray’s file cache size to 1, which helps smooth out the memory usage:
I have observed a similar memleak (config see below). It occurs for both parameters engine=netcdf4 and engine=h5netcdf.
Example for loading a 1.2GB netCDF file: In contrast, the memory is just released with a
del ds
on the object, this is the large memory (2.6GB) - ads.close()
has no effect. There is still a “minor” memleak remaining (4MB), when aopen_dataset
is called. See the output using thememory_profiler
package:open_dataset(file, engine='h5netcdf')
, the minor memleak is even larger (~9MB).open_dataset
Output of
xr.show_versions()
xarray: 0.12.3 pandas: 0.25.1 numpy: 1.16.4 scipy: 1.2.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.7.4 h5py: 2.9.0 Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.3.0 distributed: 2.3.2 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 41.0.1 pip: 19.2.3 conda: None pytest: 5.0.1 IPython: 7.7.0 sphinx: None