Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SWMR cannot be turned off once set to True; Allow forceful switching off

See original GitHub issue

Operating System: import h5py; print(h5py.version.info)
Python version: 3.7.5 (default, Oct 25 2019, 15:51:11)
Where Python was acquired: Miniconda3
h5py version: 2.10.0
HDF5 version: 1.10.5
numpy: 1.17.3

When using SWMR, are you limited to creating datasets only at creation time? Or are you supposed to be able to turn-off SWMR when it is not accessed by any other process?

Turning off SWMR is not possible (Jupyter Notebook, kernel restarted):

arr = np.array([.4, -.1, -.5, 8])
h5 = h5py.File("swmr_test.h5", 'w', libver='latest')
h5["np"] = arr
h5.swmr_mode = True
h5.swmr_mode = False

ValueError:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-22f0396ab8ff> in <module>
----> 1 h5.swmr_mode = False

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/opt/miniconda3/envs/audio_tester/lib/python3.7/site-packages/h5py/_hl/files.py in swmr_mode(self, value)
    312                 self._swmr_mode = True
    313             else:
--> 314                 raise ValueError("It is not possible to forcibly switch SWMR mode off.")
    315 
    316     def __init__(self, name, mode=None, driver=None,

ValueError: It is not possible to forcibly switch SWMR mode off.

If this is intended, it would be nice to clarify this in the documentation.

I know you don’t manage HDF5 itself, but could you help me with my thought process?

If with SWMR you’re limited to only dataset creation at file creation time, isn’t useless for almost any real environment setting? At least in terms of it being a database. In my case I’m working with Apache Airflow, meaning that independent process at various times will mostly read, but sometime write to a database. Since I’m dealing with audio data, columnar storage file formats are not the solution.

At first I thought Parallel HDF5 would be the solution here with MPI. However, you need to execute it with mpiexec, meaning you need to coordinate them and you cannot trigger it from something like Airflow. SWMR also seemed good, because to create new datasets you only needed to turn off SWMR for short moments (small risk at blocking readers is acceptable in my case). Then data can be added without interrupting readers, and without coordination. But this doesn’t seem to be how SWMR works.

Is it possible from h5py’s side to implement a forceful method to set the file back to h5.swmr_mode = False? Even if this would block readers, it would make SWMR much more useful.

Issue Analytics

State:
Created 4 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

NumesSanguiscommented, Dec 5, 2019

Thank you for the detailed response. I have a better understanding of how access to HDF5 files in a multi-process setting works.

Corrupted files is indeed not what should be risked. I’m not familiar with the exact implementation, but what I imagined was that with h5.swmr_mode = False, only subsequent reading attempts fail. Preferably some mechanism in SWMR that checks at every read attempt if it is still in SWMR mode?

Since it is only readers that will be blocked when set to False, those don’t have the power to corrupt anything, right? Although I can image that the data being pulled might be incorrect… which is also not desirable.

0reactions

takluyvercommented, Dec 8, 2019

It doesn’t sound like HDF5 is going to work well for what you’re trying to do. SWMR doesn’t really make it easy to have multiple writers, even if only one process needs to write at a time - you’d need some external mechanism to coordinate which process can be the writer, and some clunky closing/reopening of files when the writer changes. SWMR is really meant for scenarios where you have one fixed data producer and one or more plain consumers.

I’m going to close this as I don’t think there’s anything to fix in h5py: you’re just hitting the limitations of the underlying HDF5 feature.