Undefined behaviour when targeting tempfile.TemporaryFile or io.BytesIO with create_virtual_dataset
See original GitHub issueSummary:
When saving a virtual dataset to a tempfile.TemporaryFile
or io.BytesIO
object using the create_virtual_dataset
function, accessing the data causes undefined behaviour. So far, this behaviour has presented itself in the form of a segmentation fault or dataset entirely composed of the fillvalue
parameter.
The example provided below presents the fillvalue behaviour, and a larger example with real data from the GQA dataset has produced segmentation faults.
Steps to reproduce:
Run this modified version of the official simple virtual dataset sample:
import h5py
import numpy as np
import tempfile
import io
# create some sample data
data = np.arange(0, 100).reshape(1, 100) + np.arange(1, 5).reshape(4, 1)
# Create source files (0.h5 to 3.h5)
for n in range(4):
with h5py.File(f"{n}.h5", "w") as f:
d = f.create_dataset("data", (100,), "i4", data[n])
# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 100), dtype="i4")
for n in range(4):
filename = "{}.h5".format(n)
vsource = h5py.VirtualSource(filename, "data", shape=(100,))
layout[n] = vsource
# Add virtual dataset to three types of files
vds = tempfile.TemporaryFile()
bio = io.BytesIO()
with h5py.File("vds.h5", "w", libver="latest") as f:
f.create_virtual_dataset("vdata", layout, fillvalue=-5)
with h5py.File(vds, "w", libver="latest") as f:
f.create_virtual_dataset("vdata", layout, fillvalue=-5)
with h5py.File(bio, "w", libver="latest") as f:
f.create_virtual_dataset("vdata", layout, fillvalue=-5)
# virtual dataset is transparent for reader!
with h5py.File("vds.h5", "r") as f:
print("Virtual dataset:")
print(f["vdata"][:, :10])
with h5py.File(vds, "r") as f:
print("Virtual dataset (TemporaryFile):")
print(f["vdata"][:, :10])
with h5py.File(bio, "r") as f:
print("Virtual dataset (BytesIO):")
print(f["vdata"][:, :10])
Expected Output:
Virtual dataset:
[[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]]
Virtual dataset (TemporaryFile):
[[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]]
Virtual dataset (BytesIO):
[[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]]
Actual Output:
Virtual dataset:
[[ 1 2 3 4 5 6 7 8 9 10]
[ 2 3 4 5 6 7 8 9 10 11]
[ 3 4 5 6 7 8 9 10 11 12]
[ 4 5 6 7 8 9 10 11 12 13]]
Virtual dataset (TemporaryFile):
[[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]]
Virtual dataset (BytesIO):
[[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]]
System Information:
- Operating System:
Arch Linux x86_64 5.7.11-arch1-1
- Python version:
3.8.5
(system viapacman
) - h5py version:
2.10.0
- HDF5 version:
1.10.4
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
What are the differences between tempfile module and IO file ...
I have found that there are a lot of similarities between both modules in the area of creating temp files using io.BytesIO() or...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
#1805 also involves SWMR, which is an additional complication - in that case, when the source files were closed, the virtual dataset worked.
We just released 3.7 with the addition of a
virtual_prefix
parameter (in #2092) to set a filesystem prefix for finding VDS source files. Maybe that will help? But I don’t really know how virtual datasets interact with file drivers that may not have filesystem paths.If you want to create a file purely in memory, you could also use the ‘core’ driver (file driver docs). This doesn’t let you read or manipulate the raw file data like you could with
io.BytesIO()
, but it’s implemented in HDF5 itself, whereas the fileobj driver is an h5py addition.