question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault reading many groups from many files

See original GitHub issue

This is probably the wrong place to report it, but I haven’t been able to reproduce this without using xarray. Repeatedly opening NetCDF4/HDF5 files and reading a group from them, triggers a Segmentation Fault after about 130–150 openings. See details below.

Code Sample, a copy-pastable example if possible

from itertools import count, product
import netCDF4
import glob
import xarray

files = sorted(glob.glob("/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/*BODY*.nc"))

# get all groups
def get_groups(ds, pre=""):
    for g in ds.groups.keys():
        nm = pre + "/" + g
        yield from get_groups(ds[g], nm)
        yield nm
with netCDF4.Dataset(files[0]) as ds:
    groups = sorted(list(get_groups(ds)))
print("total groups", len(groups), "total files", len(files))
ds_all = []
ng = 20
nf = 20
print("using groups", ng, "using files", nf)
for (i, (g, f)) in zip(count(), product(groups[:ng], files[:nf])):
    print("attempting", i, "group", g, "from", f)
    ds = xarray.open_dataset(
            f, group=g, decode_cf=False)
    ds_all.append(ds)

Problem description

I have 70 NetCDF-4 files with 70 groups each. When I cycle through the files and read one group from them at the time, after about 130–150 times, the next opening fails with a Segmentation Fault. If I try to read one group from one file at the time, that would require a total of 70*70=4900 openings. If I limit to 20 groups from 20 files in total, it would require 400 openings. In either case, it fails after about 130–150 times. I’m using the Python xarray interface, but the error occurs in the HDF5 library. The message belows includes the traceback in Python:

HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140107218855616:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [9/1985]
  #000: H5D.c line 485 in H5Dget_create_plist(): Can't get creation plist
    major: Dataset
    minor: Can't get value
  #001: H5Dint.c line 3159 in H5D__get_create_plist(): can't get dataset's creation property list
    major: Dataset
    minor: Can't get value
  #002: H5Dint.c line 3296 in H5D_get_create_plist(): datatype conversion failed
    major: Dataset                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
    minor: Can't convert datatypes
  #003: H5T.c line 5025 in H5T_convert(): datatype conversion failed
    major: Datatype
    minor: Can't convert datatypes
  #004: H5Tconv.c line 3227 in H5T__conv_vlen(): can't read VL data
    major: Datatype
    minor: Read failed
  #005: H5Tvlen.c line 853 in H5T_vlen_disk_read(): Unable to read VL information
    major: Datatype
    minor: Read failed
  #006: H5HG.c line 611 in H5HG_read(): unable to protect global heap
    major: Heap
    minor: Unable to protect metadata
  #007: H5HG.c line 264 in H5HG__protect(): unable to protect global heap
    major: Heap
    minor: Unable to protect metadata
  #008: H5AC.c line 1591 in H5AC_protect(): unable to get logging status
    major: Object cache
    minor: Internal error detected
  #009: H5Clog.c line 313 in H5C_get_logging_status(): cache magic value incorrect
    major: Invalid arguments to routine
    minor: Bad value
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 140107218855616:
  #000: H5L.c line 1138 in H5Literate(): link iteration failed
    major: Links
    minor: Iteration failed
  #001: H5L.c line 3440 in H5L__iterate(): link iteration failed
    major: Links
    minor: Iteration failed
  #002: H5Gint.c line 893 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #003: H5Gobj.c line 683 in H5G__obj_iterate(): can't iterate over dense links
    major: Symbol table
    minor: Iteration failed
  #004: H5Gdense.c line 1054 in H5G__dense_iterate(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
  #005: H5Glink.c line 493 in H5G__link_iterate_table(): iteration operator failed
    major: Symbol table
    minor: Can't move to next iterator location
Traceback (most recent call last):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 167, in acquire
    file = self._cache[self._key]
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/lru_cache.py", line 41, in __getitem__
    value = self._cache[key]
KeyError: [<function _open_netcdf4_group at 0x7f6d27b0f7b8>, ('/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410114417_GTT_DEV_20170410113908_20170410113917_N__C_0070_0065.nc', CombinedLock([<SerializableLock: 30e581d6-154c-486b-8b6a-b9a6c347f4e4>, <SerializableLock: bb132fc5-db57-499d-bc1f-661bc0025616>])), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('group', '/data/vis_04/measured'), ('persist
', False))]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/mwe9.py", line 24, in <module>
    f, group=g, decode_cf=False)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/api.py", line 363, in open_dataset
    filename_or_obj, group=group, lock=lock, **backend_kwargs)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 352, in open
    return cls(manager, lock=lock, autoclose=autoclose)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 311, in __init__
    self.format = self.ds.data_model
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 356, in ds
    return self._manager.acquire().value
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173, in acquire
    file = self._opener(*self._args, **kwargs)
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244, in _open_netcdf4_group
    ds = nc4.Dataset(filename, mode=mode, **kwargs)
  File "netCDF4/_netCDF4.pyx", line 2291, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1855, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -101] NetCDF: HDF error: b'/media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410114417_GTT_DEV_20170410113908_20170410113917_N__C_0070_0065.nc'

More usually however, it fails with a Segmentation Fault and no further information.

The failure might happen in any file.

The full output of my script might end with:

attempting 137 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113734_GTT_DEV_20170410113225_20170410113234_N__C_0070_0018.nc
attempting 138 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113742_GTT_DEV_20170410113234_20170410113242_N__C_0070_0019.nc
attempting 139 group /data/ir_123/measured from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113751_GTT_DEV_20170410113242_20170410113251_N__C_0070_0020.nc
attempting 140 group /data/ir_123/quality_channel from /media/nas/x21308/2019_05_Testdata/MTG/FCI/FDHSI/uncompressed/20170410_RC70/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY--L2P-NC4E_C_EUMT_20170410113508_GTT_DEV_20170410113000_20170410113008_N__C_0070_0001.nc
Fatal Python error: Segmentation fault

prior to the segmentation fault. When running with -X faulthandler and a segmentation fault happens:

Fatal Python error: Segmentation fault

Current thread 0x00007ff6ab89d6c0 (most recent call first):
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 244 in _open_netcdf4_group
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/file_manager.py", line 173 in acquire
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 356 in ds
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 311 in __init__
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/netCDF4_.py", line 352 in open
  File "/media/nas/x21324/miniconda3/envs/py37d/lib/python3.7/site-packages/xarray/backends/api.py", line 363 in open_dataset
  File "/tmp/mwe9.py", line 24 in <module>
Segmentation fault (core dumped)

Expected Output

I expect no segmentation fault.

Output of xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.1 | packaged by conda-forge | (default, Feb 18 2019, 01:42:00)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.12.14-lp150.12.58-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.0
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.5.0.1
pydap: None
h5netcdf: 0.7.1
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: 1.0.22
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.5
distributed: 1.26.1
matplotlib: 3.0.3
cartopy: 0.17.0
seaborn: None
setuptools: 40.8.0
pip: 19.0.3
conda: None
pytest: None
IPython: 7.4.0
sphinx: 2.0.0

The machine is running openSUSE 15.0 with Linux oflws222 4.12.14-lp150.12.58-default #1 SMP Mon Apr 1 15:20:46 UTC 2019 (58fcc15) x86_64 x86_64 x86_64 GNU/Linux.

The problem has also been reported on other machines, such as one running CentOS Linux release 7.6.1810 (Core) with Linux oflks333.dwd.de 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The HDF5 installation on my machine is from the SuSe package. From strings /usr/lib64/libhdf5.so, I get:

            SUMMARY OF THE HDF5 CONFIGURATION
            =================================
General Information:
-------------------
                   HDF5 Version: 1.10.1
                    Host system: x86_64-suse-linux-gnu
                       Byte sex: little-endian
             Installation point: /usr
Compiling Options:
------------------
                     Build Mode: production
              Debugging Symbols: no
                        Asserts: no
                      Profiling: no
             Optimization Level: high
Linking Options:
----------------
                      Libraries: static, shared
  Statically Linked Executables:
                        LDFLAGS:
                     H5_LDFLAGS:
                     AM_LDFLAGS:
                Extra libraries: -lpthread -lz -ldl -lm
                       Archiver: ar
                         Ranlib: ranlib
Languages:
----------
                              C: yes
                     C Compiler: /usr/bin/gcc
                       CPPFLAGS:
                    H5_CPPFLAGS: -D_GNU_SOURCE -D_POSIX_C_SOURCE=200112L   -DNDEBUG -UH5_DEBUG_API
                    AM_CPPFLAGS:
                        C Flags: -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g
                     H5 C Flags:   -std=c99 -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wbad-function-cast -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wnested-externs -finline-functions -s -Wno-inline -Wno-aggregate-return -O
                     AM C Flags:
               Shared C Library: yes
               Static C Library: yes
                        Fortran: yes
               Fortran Compiler: /usr/bin/gfortran
                  Fortran Flags:
               H5 Fortran Flags:  -pedantic -Wall -Wextra -Wunderflow -Wimplicit-interface -Wsurprising -Wno-c-binding-type  -s -O2
               AM Fortran Flags:
         Shared Fortran Library: yes
         Static Fortran Library: yes
                            C++: yes
                   C++ Compiler: /usr/bin/g++
                      C++ Flags: -fmessage-length=0 -grecord-gcc-switches -O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -g
                   H5 C++ Flags:   -pedantic -Wall -W -Wundef -Wshadow -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wconversion -Wredundant-decls -Winline -Wsign-promo -Woverloaded-virtual -Wold-style-cast -Weffc++ -Wreorder -Wnon-virtual-dtor -Wctor-dtor-privacy -Wabi -finline-functions -s -O
                   AM C++ Flags:
             Shared C++ Library: yes
             Static C++ Library: yes
                           Java: no
Features:
---------
                  Parallel HDF5: no
             High-level library: yes
                   Threadsafety: yes
            Default API mapping: v110
 With deprecated public symbols: yes
         I/O filters (external): deflate(zlib)
                            MPE: no
                     Direct VFD: no
                        dmalloc: no
 Packages w/ extra debug output: none
                    API tracing: no
           Using memory checker: no
Memory allocation sanity checks: no
            Metadata trace file: no
         Function stack tracing: no
      Strict file format checks: no
   Optimization instrumentation: no

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

2reactions
gerrithollcommented, May 13, 2019

In our code, this problem gets triggered because of xarrays lazy handling. If we have

with xr.open_dataset('file.nc') as ds:
    val = ds["field"]
return val

then when a caller tries to use val, xarray reopens the dataset and does not close it again. This means the context manager is actually useless: we’re using the context manager to close the file as soon as we have accessed the value, but later the file gets opened again anyway. This is against the intention of the code.

We can avoid this by calling val.load() from within the context manager, as the linked satpy PR above does. But what is the intention of xarrays design here? Should lazy reading close the file after opening and reading the value? I would say it probably should do something like

if file_was_not_open:
    open file
    get value
    close file # this step currently omitted
    return value
else:
    get value
    return value

is not closing the file after it has been opened for retrieving a “lazy” file by design, or might this be considered a wart/bug?

0reactions
gerrithollcommented, Jul 8, 2019

And I can confirm that the problem I reported originally on May 10 is also gone with #3082.

Read more comments on GitHub >

github_iconTop Results From Across the Web

c++, reading files, segmentation fault - Stack Overflow
I have tried printing out for testing, and the error is reading the last line. It finishes reading the last line for string...
Read more >
[ncl-talk] Segmentation fault (core dumped) when reading hdf5
Each group has its own set of variables. Group *emissions* has *many* variables. eg: group: *biosphere*/ contains "groups" with months: *01, ...
Read more >
Sorting bigger files gives segmentation fault - Unix & Linux ...
The solution to my issue was to break up the file with grep as a pre-processor. Take a look at your data to...
Read more >
Identify what's causing segmentation faults (segfaults)
A segmentation fault (aka segfault) is a common condition that causes programs to crash; they are often associated with a file named core...
Read more >
Segmentation fault when reading a homegrown HDF5 file
Paraview first-timer here. Our app creates HDF5 files with a very simple structure: h5dump -A cart_cr2130.h5 HDF5 "cart_cr2130.h5" { GROUP ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found