question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Undefined behaviour when targeting tempfile.TemporaryFile or io.BytesIO with create_virtual_dataset

See original GitHub issue

Summary: When saving a virtual dataset to a tempfile.TemporaryFile or io.BytesIO object using the create_virtual_dataset function, accessing the data causes undefined behaviour. So far, this behaviour has presented itself in the form of a segmentation fault or dataset entirely composed of the fillvalue parameter.

The example provided below presents the fillvalue behaviour, and a larger example with real data from the GQA dataset has produced segmentation faults.

Steps to reproduce:

Run this modified version of the official simple virtual dataset sample:

import h5py
import numpy as np
import tempfile
import io

# create some sample data
data = np.arange(0, 100).reshape(1, 100) + np.arange(1, 5).reshape(4, 1)

# Create source files (0.h5 to 3.h5)
for n in range(4):
    with h5py.File(f"{n}.h5", "w") as f:
        d = f.create_dataset("data", (100,), "i4", data[n])

# Assemble virtual dataset
layout = h5py.VirtualLayout(shape=(4, 100), dtype="i4")
for n in range(4):
    filename = "{}.h5".format(n)
    vsource = h5py.VirtualSource(filename, "data", shape=(100,))
    layout[n] = vsource

# Add virtual dataset to three types of files
vds = tempfile.TemporaryFile()
bio = io.BytesIO()

with h5py.File("vds.h5", "w", libver="latest") as f:
    f.create_virtual_dataset("vdata", layout, fillvalue=-5)
with h5py.File(vds, "w", libver="latest") as f:
    f.create_virtual_dataset("vdata", layout, fillvalue=-5)
with h5py.File(bio, "w", libver="latest") as f:
    f.create_virtual_dataset("vdata", layout, fillvalue=-5)

# virtual dataset is transparent for reader!
with h5py.File("vds.h5", "r") as f:
    print("Virtual dataset:")
    print(f["vdata"][:, :10])

with h5py.File(vds, "r") as f:
    print("Virtual dataset (TemporaryFile):")
    print(f["vdata"][:, :10])

with h5py.File(bio, "r") as f:
    print("Virtual dataset (BytesIO):")
    print(f["vdata"][:, :10])

Expected Output:

Virtual dataset:
[[ 1  2  3  4  5  6  7  8  9 10]
 [ 2  3  4  5  6  7  8  9 10 11]
 [ 3  4  5  6  7  8  9 10 11 12]
 [ 4  5  6  7  8  9 10 11 12 13]]
Virtual dataset (TemporaryFile):
[[ 1  2  3  4  5  6  7  8  9 10]
 [ 2  3  4  5  6  7  8  9 10 11]
 [ 3  4  5  6  7  8  9 10 11 12]
 [ 4  5  6  7  8  9 10 11 12 13]]
Virtual dataset (BytesIO):
[[ 1  2  3  4  5  6  7  8  9 10]
 [ 2  3  4  5  6  7  8  9 10 11]
 [ 3  4  5  6  7  8  9 10 11 12]
 [ 4  5  6  7  8  9 10 11 12 13]]

Actual Output:

Virtual dataset:
[[ 1  2  3  4  5  6  7  8  9 10]
 [ 2  3  4  5  6  7  8  9 10 11]
 [ 3  4  5  6  7  8  9 10 11 12]
 [ 4  5  6  7  8  9 10 11 12 13]]
Virtual dataset (TemporaryFile):
[[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
 [-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
 [-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
 [-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]]
Virtual dataset (BytesIO):
[[-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
 [-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
 [-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]
 [-5 -5 -5 -5 -5 -5 -5 -5 -5 -5]]

System Information:

  • Operating System: Arch Linux x86_64 5.7.11-arch1-1
  • Python version: 3.8.5 (system via pacman)
  • h5py version: 2.10.0
  • HDF5 version: 1.10.4

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
takluyvercommented, May 24, 2022

#1805 also involves SWMR, which is an additional complication - in that case, when the source files were closed, the virtual dataset worked.

We just released 3.7 with the addition of a virtual_prefix parameter (in #2092) to set a filesystem prefix for finding VDS source files. Maybe that will help? But I don’t really know how virtual datasets interact with file drivers that may not have filesystem paths.

1reaction
takluyvercommented, Aug 6, 2020

If you want to create a file purely in memory, you could also use the ‘core’ driver (file driver docs). This doesn’t let you read or manipulate the raw file data like you could with io.BytesIO(), but it’s implemented in HDF5 itself, whereas the fileobj driver is an h5py addition.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What are the differences between tempfile module and IO file ...
I have found that there are a lot of similarities between both modules in the area of creating temp files using io.BytesIO() or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found