question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

open_mfdataset memory leak, very simple case. v0.12

See original GitHub issue

MCVE Code Sample

import glob
import xarray as xr
import numpy as np
from memory_profiler import profile

def CreateTestFiles():
        # create a bunch of files
        xlen = int(1e2)
        ylen = int(1e2)
        xdim = np.arange(xlen)
        ydim = np.arange(ylen)

        nfiles = 100
        for i in range(nfiles):
                data = np.random.rand(xlen, ylen, 1)
                datafile = xr.DataArray(data, coords=[xdim, ydim, i], dims=['x', 'y', 'time'])
                datafile.to_netcdf('testfiles/datafile_{}.nc'.format(i))

@profile
def ReadFiles():
        xr.open_mfdataset(glob.glob('testfiles/*'), concat_dim='time')

if __name__ == '__main__':
        # write out files for testing 
        CreateTestFiles()

        # loop thru file read step
        for i in range(100):
                ReadFiles()
~
~

usage:

mprof run simplest_case.py mprof plot

(mprof is a python memory profiling library)

Problem Description

dask version 1.1.4 xarray version 0.12 python 3.7.3

There appears to be a persistent memory leak in open_mfdataset. I’m creating a model calibration script that runs for ~1000 iterations, opening and closing the same set of files (dimensions are the same, but the data is different) with each iteration. I eventually run out of memory because of the leak. This simple case captures the same behavior. Closing the files with .close() does not fix the problem.

Is there a work around for this? I’ve perused some of the issues but cannot tell if this has been resolved.

Figure_1

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-693.17.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.2

xarray: 0.12.0 pandas: 0.24.2 numpy: 1.16.2 scipy: 1.2.1 netCDF4: 1.4.2 pydap: None h5netcdf: None h5py: None Nio: 1.5.5 zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudonetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.2.1 dask: 1.1.4 distributed: 1.26.0 matplotlib: 3.0.2 cartopy: 0.17.0 seaborn: None setuptools: 41.0.1 pip: 19.1.1 conda: None pytest: None IPython: 7.3.0 sphinx: None

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
shoyercommented, Aug 10, 2019

Thanks for the profiling script. I ran a few permutations of this:

  • xarray.open_mfdataset with engine='netcdf4' (default)
  • xarray.open_mfdataset with engine='h5netcdf'
  • xarray.open_dataset with engine='netcdf4' (default)
  • xarray.open_dataset with engine='h5netcdf'

Here are some plots:

xarray.open_mfdataset with engine='netcdf4': pretty noticeable memory leak, about 0.5 MB / open_mfdataset call: netcdf4-maxsize1

xarray.open_mfdataset with engine='h5netcdf': looks like a small memory leak, about 0.1 MB / open_mfdataset call: h5netcdf-maxsize1

xarray.open_dataset with engine='netcdf4' (default): definitely has a memory leak: open_dataset - netcdf4

xarray.open_dataset with engine='h5netcdf': does not appear to have a memory leak: open_dataset - h5netcdf

So in conclusion, it looks like there are memory leaks:

  1. when using netCDF4-Python (I was also able to confirm these without using xarray at all, just using netCDF4.Dataset)
  2. when using xarray.open_mfdataset

(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files.

(2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using open_mfdataset, I suspect it may have more to do with the interaction with Dask, though to be honest I’m not exactly sure how.

Note: I modified your script to xarray’s file cache size to 1, which helps smooth out the memory usage:

def CreateTestFiles():
        # create a bunch of files
        xlen = int(1e2)
        ylen = int(1e2)
        xdim = np.arange(xlen)
        ydim = np.arange(ylen)

        nfiles = 100
        for i in range(nfiles):
                data = np.random.rand(xlen, ylen, 1)
                datafile = xr.DataArray(data, coords=[xdim, ydim, [i]], dims=['x', 'y', 'time'])
                datafile.to_netcdf('testfile_{}.nc'.format(i))

@profile
def ReadFiles():
        # for i in range(100):
        #         ds = xr.open_dataset('testfile_{}.nc'.format(i), engine='netcdf4')
        #         ds.close()
        ds = xr.open_mfdataset(glob.glob('testfile_*'), engine='h5netcdf', concat_dim='time')
        ds.close()

if __name__ == '__main__':
        # write out files for testing 
        CreateTestFiles()

        xr.set_options(file_cache_maxsize=1)

        # loop thru file read step
        for i in range(100):
                ReadFiles()
0reactions
floschlcommented, Sep 12, 2019

I have observed a similar memleak (config see below). It occurs for both parameters engine=netcdf4 and engine=h5netcdf.

Example for loading a 1.2GB netCDF file: In contrast, the memory is just released with a del ds on the object, this is the large memory (2.6GB) - a ds.close() has no effect. There is still a “minor” memleak remaining (4MB), when a open_dataset is called. See the output using the memory_profiler package:

Line #    Mem usage    Increment   Line Contents
================================================
    31    168.9 MiB    168.9 MiB   @profile
    32                             def load_and_unload_ds():
    33    173.0 MiB      4.2 MiB       ds = xr.open_dataset(LFS_DATA_DIR + '/dist2coast_1deg_merged.nc')
    34   2645.4 MiB   2472.4 MiB       ds.load()
    35   2645.4 MiB      0.0 MiB       ds.close()
    36    173.5 MiB      0.0 MiB       del ds
  • there is no difference using open_dataset(file, engine='h5netcdf'), the minor memleak is even larger (~9MB).
  • memory leak persists, if an additional chunks parameter is used for open_dataset

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-62-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2

xarray: 0.12.3 pandas: 0.25.1 numpy: 1.16.4 scipy: 1.2.1 netCDF4: 1.5.1.2 pydap: None h5netcdf: 0.7.4 h5py: 2.9.0 Nio: None zarr: None cftime: 1.0.3.4 nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: None dask: 2.3.0 distributed: 2.3.2 matplotlib: 3.1.1 cartopy: 0.17.0 seaborn: None numbagg: None setuptools: 41.0.1 pip: 19.2.3 conda: None pytest: 5.0.1 IPython: 7.7.0 sphinx: None

Read more comments on GitHub >

github_iconTop Results From Across the Web

MemLab: An open source framework for finding JavaScript ...
We've open-sourced MemLab, a JavaScript memory testing framework that automates memory leak detection. Finding and addressing the root cause ...
Read more >
Where is the memory leak / simple callbacks? - Stack Overflow
The GC algorithm that the V8 uses, will attempt to take up as much memory as it can, and is very lazy when...
Read more >
Understanding Memory Leaks in Java - Baeldung
A Memory Leak is a situation where there are objects present in the heap that are no longer used, but the garbage collector...
Read more >
Memory leak in C++ and How to avoid it? - GeeksforGeeks
Disadvantage with memory leakage: If a program has memory leaks, then its memory usage is satirically increasing since all systems have limited ...
Read more >
Memory Leak Diagnosis on AIX - IBM
In this example, I have a program called 'leaktest' which is designed to leak memory. The program simply allocates 1MB repeatedly without ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found