Segmentation fault reading many groups from many files
See original GitHub issueThis is probably the wrong place to report it, but I haven’t been able to reproduce this without using xarray. Repeatedly opening NetCDF4/HDF5 files and reading a group from them, triggers a Segmentation Fault after about 130–150 openings. See details below.
Code Sample, a copy-pastable example if possible
from itertools import count, product
import netCDF4
import glob
import xarray
files = sorted(glob.glob("/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/*BODY*.nc"))
# get all groups
def get_groups(ds, pre=""):
for g in ds.groups.keys():
nm = pre + "/" + g
yield from get_groups(ds[g], nm)
yield nm
with netCDF4.Dataset(files[0]) as ds:
groups = sorted(list(get_groups(ds)))
print("total groups", len(groups), "total files", len(files))
ds_all = []
ng = 20
nf = 20
print("using groups", ng, "using files", nf)
for (i, (g, f)) in zip(count(), product(groups[:ng], files[:nf])):
print("attempting", i, "group", g, "from", f)
ds = xarray.open_dataset(
f, group=g, decode_cf=False)
ds_all.append(ds)
Problem description
I have 70 NetCDF-4 files with 70 groups each. When I cycle through the files and read one group from them at the time, after about 130–150 times, the next opening fails with a Segmentation Fault. If I try to read one group from one file at the time, that would require a total of 70*70=4900 openings. If I limit to 20 groups from 20 files in total, it would require 400 openings. In either case, it fails after about 130–150 times. I’m using the Python xarray interface, but the error occurs in the HDF5 library. The message belows includes the traceback in Python:
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140107218855616: [9/1985]
#000: H5D.c line 485 in H5Dget_create_plist(): Can't get creation plist
major: Dataset
minor: Can't get value
#001: H5Dint.c line 3159 in H5D__get_create_plist(): can't get dataset's creation property list
major: Dataset
minor: Can't get value
#002: H5Dint.c line 3296 in H5D_get_create_plist(): datatype conversion failed
major: Dataset
minor: Can't convert datatypes
#003: H5T.c line 5025 in H5T_convert(): datatype conversion failed
major: Datatype
minor: Can't convert datatypes
#004: H5Tconv.c line 3227 in H5T__conv_vlen(): can't read VL data
major: Datatype
minor: Read failed
#005: H5Tvlen.c line 853 in H5T_vlen_disk_read(): Unable to read VL information
major: Datatype
minor: Read failed
#006: H5HG.c line 611 in H5HG_read(): unable to protect global heap
major: Heap
minor: Unable to protect metadata
#007: H5HG.c line 264 in H5HG__protect(): unable to protect global heap
major: Heap
minor: Unable to protect metadata
#008: H5AC.c line 1591 in H5AC_protect(): unable to get logging status
major: Object cache
minor: Internal error detected
#009: H5Clog.c line 313 in H5C_get_logging_status(): cache magic value incorrect
major: Invalid arguments to routine
minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140107218855616:
#000: H5L.c line 1138 in H5Literate(): link iteration failed
major: Links
minor: Iteration failed
#001: H5L.c line 3440 in H5L__iterate(): link iteration failed
major: Links
minor: Iteration failed
#002: H5Gint.c line 893 in H5G_iterate(): error iterating over links
major: Symbol table
minor: Iteration failed
#003: H5Gobj.c line 683 in H5G__obj_iterate(): can't iterate over dense links
major: Symbol table
minor: Iteration failed
#004: H5Gdense.c line 1054 in H5G__dense_iterate(): iteration operator failed
major: Symbol table
minor: Can't move to next iterator location
#005: H5Glink.c line 493 in H5G__link_iterate_table(): iteration operator failed
major: Symbol table
minor: Can't move to next iterator location
Traceback (most recent call last):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 167, in acquire
file = self._cache[self._key]
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/lru_cache.py", line 41, in __getitem__
value = self._cache[key]
KeyError: [<function _open_netcdf4_group at 0x7f6d27b0f7b8>, ('/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410114417_GTT_DEV_20170410113908_20170410113917_N__C_0070_0065.nc', CombinedLock([<SerializableLock: 30e581d6-154c-486b-8b6a-b9a6c347f4e4>, <SerializableLock: bb132fc5-db57-499d-bc1f-661bc0025616>])), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('group', '/data/vis_04/measured'), ('persist
', False))]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/mwe9.py", line 24, in <module>
f, group=g, decode_cf=False)
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/api.py", line 363, in open_dataset
filename_or_obj, group=group, lock=lock, **backend_kwargs)
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 352, in open
return cls(manager, lock=lock, autoclose=autoclose)
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 311, in __init__
self.format = self.ds.data_model
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 356, in ds
return self._manager.acquire().value
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173, in acquire
file = self._opener(*self._args, **kwargs)
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244, in _open_netcdf4_group
ds = nc4.Dataset(filename, mode=mode, **kwargs)
File "netCDF4/_netCDF4.pyx", line 2291, in netCDF4._netCDF4.Dataset.__init__
File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410114417_GTT_DEV_20170410113908_20170410113917_N__C_0070_0065.nc'
More usually however, it fails with a Segmentation Fault and no further information.
The failure might happen in any file.
The full output of my script might end with:
attempting 137 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113734_GTT_DEV_20170410113225_20170410113234_N__C_0070_0018.nc
attempting 138 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113742_GTT_DEV_20170410113234_20170410113242_N__C_0070_0019.nc
attempting 139 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113751_GTT_DEV_20170410113242_20170410113251_N__C_0070_0020.nc
attempting 140 group /data/ir_123/quality_channel from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113508_GTT_DEV_20170410113000_20170410113008_N__C_0070_0001.nc
Fatal Python error: Segmentation fault
prior to the segmentation fault. When running with -X faulthandler
and a segmentation fault happens:
Fatal Python error: Segmentation fault
Current thread 0x00007ff6ab89d6c0 (most recent call first):
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244 in _open_netcdf4_group
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173 in acquire
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 356 in ds
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 311 in __init__
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 352 in open
File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/api.py", line 363 in open_dataset
File "/tmp/mwe9.py", line 24 in <module>
Segmentation fault (core dumped)
Expected Output
I expect no segmentation fault.
Output of xr.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1 | packaged by conda-forge | (default, Feb 18 2019, 01:42:00)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.58-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2
xarray: 0.12.0
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.5.0.1
pydap: None
h5netcdf: 0.7.1
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.5
distributed: 1.26.1
matplotlib: 3.0.3
cartopy: 0.17.0
seaborn: None
setuptools: 40.8.0
pip: 19.0.3
conda: None
pytest: None
IPython: 7.4.0
sphinx: 2.0.0
The machine is running openSUSE 15.0 with Linux oflws222 4.12.14-lp150.12.58-default #1 SMP Mon Apr 1 15:20:46 UTC 2019 (58fcc15) x86_64 x86_64 x86_64 GNU/Linux
.
The problem has also been reported on other machines, such as one running CentOS Linux release 7.6.1810 (Core) with Linux oflks333.dwd.de 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
The HDF5 installation on my machine is from the SuSe package. From strings /usr/lib64/libhdf5.so
, I get:
SUMMARY OF THE HDF5 CONFIGURATION
=================================
General Information:
-------------------
HDF5 Version: 1.10.1
Host system: x86_64-suse-linux-gnu
Byte sex: little-endian
Installation point: /usr
Compiling Options:
------------------
Build Mode: production
Debugging Symbols: no
Asserts: no
Profiling: no
Optimization Level: high
Linking Options:
----------------
Libraries: static, shared
Statically Linked Executables:
LDFLAGS:
H5_LDFLAGS:
AM_LDFLAGS:
Extra libraries: -lpthread -lz -ldl -lm
Archiver: ar
Ranlib: ranlib
Languages:
----------
C: yes
C Compiler: /usr/bin/gcc
CPPFLAGS:
H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200112L -DNDEBUG -UH5_DEBUG_API
AM_CPPFLAGS:
C Flags: -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g
H5 C Flags: -std=c99 -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wnested-externs -finline-functions -s -Wno-inline -Wno-aggregate-return -O
AM C Flags:
Shared C Library: yes
Static C Library: yes
Fortran: yes
Fortran Compiler: /usr/bin/gfortran
Fortran Flags:
H5 Fortran Flags: -pedantic -Wall -Wextra -Wunderflow -Wimplicit-interface -Wsurprising -Wno-c-binding-type -s -O2
AM Fortran Flags:
Shared Fortran Library: yes
Static Fortran Library: yes
C++: yes
C++ Compiler: /usr/bin/g++
C++ Flags: -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g
H5 C++ Flags: -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wredundant-decls -Winline -Wsign-promo -Woverloaded-virtual -Wold-style-cast -Weffc++ -Wreorder -Wnon-virtual-dtor -Wctor-dtor-privacy -Wabi -finline-functions -s -O
AM C++ Flags:
Shared C++ Library: yes
Static C++ Library: yes
Java: no
Features:
---------
Parallel HDF5: no
High-level library: yes
Threadsafety: yes
Default API mapping: v110
With deprecated public symbols: yes
I/O filters (external): deflate(zlib)
MPE: no
Direct VFD: no
dmalloc: no
Packages w/ extra debug output: none
API tracing: no
Using memory checker: no
Memory allocation sanity checks: no
Metadata trace file: no
Function stack tracing: no
Strict file format checks: no
Optimization instrumentation: no
Issue Analytics
- State:
- Created 4 years ago
- Comments:14 (14 by maintainers)
Top GitHub Comments
In our code, this problem gets triggered because of xarrays lazy handling. If we have
then when a caller tries to use
val
, xarray reopens the dataset and does not close it again. This means the context manager is actually useless: we’re using the context manager to close the file as soon as we have accessed the value, but later the file gets opened again anyway. This is against the intention of the code.We can avoid this by calling
val.load()
from within the context manager, as the linked satpy PR above does. But what is the intention of xarrays design here? Should lazy reading close the file after opening and reading the value? I would say it probably should do something likeis not closing the file after it has been opened for retrieving a “lazy” file by design, or might this be considered a wart/bug?
And I can confirm that the problem I reported originally on May 10 is also gone with #3082.