Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Segmentation fault writing VLEN Dataset of bytes dataset

See original GitHub issue

Issue I’m getting the old and 😨 “Segmentation fault (core dumped)” When? Dumping a Dataset of variable length with raw binary data.

Details

I need to use an adapter (Google Protobuf) to load the binary data. Thus, I haven’t tried encoding onto utf-8 or ascii.

Thanks!

To assist reproducing bugs, please include the following:

Operating System: Ubuntu 16.04.2 LTS)
Python version: 3.8.5
Where Python was acquired: miniconda3.
h5py version: 3.1.0
HDF5 version: 1.12.0
The full traceback/stack trace shown (if it appears)

Summary of the h5py configuration
---------------------------------

h5py    3.1.0
HDF5    1.12.0
Python  3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52) 
[GCC 7.5.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.19.1
cython (built with) 0.29.21
numpy (built against) 1.17.5
HDF5 (built against) 1.12.0

Issue Analytics

State:
Created 3 years ago
Comments:6 (1 by maintainers)

Top GitHub Comments

1reaction

takluyvercommented, Feb 10, 2021

The specific segfault is a bug - probably in HDF5 itself, though I can’t be certain of that without reproducing it in C code.

The bigger context is that HDF5 isn’t that well suited to storing arbitrary blobs of binary data. It has the opaque data type, which h5py maps to numpy void dtypes, but by itself, this is a fixed size container, so you declare up front that every entry has 12 bytes (for instance). You should be able to get round that by making a vlen array dtype of 1-byte opaque values, i.e. f.create_dataset(... , dtype=h5py.vlen_dtype(np.dtype('V1'))). Try that, but be aware it’s not the main use case for HDF5.

Storing each entry in its own dataset or attribute, as you’ve done, works because each dataset/attribute has its own dtype, so you can use a different fixed-size opaque type for each. But that means storing them by name instead of number, which is more awkward, and can’t take advantage of HDF5’s chunking and compression.

What else could you do?

Google suggest making a simple custom format with each protobuf message preceded by its length in bytes, so you know how much data to parse when reading it.
- This could be gzipped if you want to make the file smaller
Or frame it in something like msgpack, which does basically the same thing for you.
If random access is important (‘get message 1234’), an sqlite database with a BLOB column might work.
Pickle might be OK if you know you will always be reading data that you trust, but it’s not safe to unpickle data which someone could have created maliciously.

That’s all assuming that you have to store protobuf messages. If protobuf is Y in an XY problem, there are many more options.

0reactions

escorciavcommented, Feb 22, 2021

@takluyver I tried using np.dtype('V1'), but I still get segmentation fault 😓 . The code snippet below.

Must all the strings have the same length? Do I need to update the code for ragged strings?

If you feel like trying on your end, I’m using the PKL with protobuf data of this repo. Happy to share the first part of the code that loads data in memory 😄 .

fid = h5py.File('foo.h5', 'w')
gid = fid.create_group(video_id)
dset = gid.create_dataset(
    'data', (len(data),),
    dtype=h5py.vlen_dtype(np.dtype('V1'))
)
for i, detection_i in enumerate(data):
    binary_blob = detection_i.to_protobuf().SerializeToString()
    dset[i] = np.void(binary_blob)
    if i == 2:
        break
print('Dumped')
fid.close()

Top Results From Across the Web

Developers - Segmentation fault writing VLEN Dataset of bytes ...

Issue I'm getting the old and "Segmentation fault (core dumped)" When? Dumping a Dataset of variable length with raw binary...

Segmentation fault when using DB (define byte) inside a ...

I know data should go to the .data section but I was wondering why it gives me a segmentation fault when I do...

What's new in h5py 3.0

New Dataset.iter_chunks() method, to iterate over chunks within the given selection. ... Fix segmentation fault when accessing vlen of strings (GH1336).

NetCDF-Fortran: 6 Variables

Variables for a netCDF dataset are defined when the dataset is created, ... of the data (or the size of nc_vlen_t for VLEN...

Segfault reading variable length packet table - Google Groups

One question that this raises: is it possible to write a dataset which dumps fine with h5dump, but it corrupted and will crash...