Segmentation fault writing VLEN Dataset of bytes dataset
See original GitHub issueIssue Iām getting the old and šØ āSegmentation fault (core dumped)ā When? Dumping a Dataset of variable length with raw binary data.
Details
- I need to use an adapter (Google Protobuf) to load the binary data. Thus, I havenāt tried encoding onto
utf-8
orascii
.
Thanks!
To assist reproducing bugs, please include the following:
- Operating System: Ubuntu 16.04.2 LTS)
- Python version: 3.8.5
- Where Python was acquired: miniconda3.
- h5py version: 3.1.0
- HDF5 version: 1.12.0
- The full traceback/stack trace shown (if it appears)
Summary of the h5py configuration
---------------------------------
h5py 3.1.0
HDF5 1.12.0
Python 3.8.5 | packaged by conda-forge | (default, Sep 24 2020, 16:55:52)
[GCC 7.5.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.19.1
cython (built with) 0.29.21
numpy (built against) 1.17.5
HDF5 (built against) 1.12.0
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Developers - Segmentation fault writing VLEN Dataset of bytes ...
Issue I'm getting the old and "Segmentation fault (core dumped)" When? Dumping a Dataset of variable length with raw binary...
Read more >Segmentation fault when using DB (define byte) inside a ...
I know data should go to the .data section but I was wondering why it gives me a segmentation fault when I do...
Read more >What's new in h5py 3.0
New Dataset.iter_chunks() method, to iterate over chunks within the given selection. ... Fix segmentation fault when accessing vlen of strings (GH1336).
Read more >NetCDF-Fortran: 6 Variables
Variables for a netCDF dataset are defined when the dataset is created, ... of the data (or the size of nc_vlen_t for VLEN...
Read more >Segfault reading variable length packet table - Google Groups
One question that this raises: is it possible to write a dataset which dumps fine with h5dump, but it corrupted and will crash...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The specific segfault is a bug - probably in HDF5 itself, though I canāt be certain of that without reproducing it in C code.
The bigger context is that HDF5 isnāt that well suited to storing arbitrary blobs of binary data. It has the opaque data type, which h5py maps to numpy void dtypes, but by itself, this is a fixed size container, so you declare up front that every entry has 12 bytes (for instance). You should be able to get round that by making a vlen array dtype of 1-byte opaque values, i.e.
f.create_dataset(... , dtype=h5py.vlen_dtype(np.dtype('V1')))
. Try that, but be aware itās not the main use case for HDF5.Storing each entry in its own dataset or attribute, as youāve done, works because each dataset/attribute has its own dtype, so you can use a different fixed-size opaque type for each. But that means storing them by name instead of number, which is more awkward, and canāt take advantage of HDF5ās chunking and compression.
What else could you do?
Thatās all assuming that you have to store protobuf messages. If protobuf is Y in an XY problem, there are many more options.
@takluyver I tried using
np.dtype('V1')
, but I still get segmentation fault š . The code snippet below.Must all the strings have the same length? Do I need to update the code for ragged strings?
If you feel like trying on your end, Iām using the PKL with protobuf data of this repo. Happy to share the first part of the code that loads
data
in memory š .