SWMR cannot be turned off once set to True; Allow forceful switching off
See original GitHub issue- Operating System: import h5py; print(h5py.version.info)
- Python version: 3.7.5 (default, Oct 25 2019, 15:51:11)
- Where Python was acquired: Miniconda3
- h5py version: 2.10.0
- HDF5 version: 1.10.5
- numpy: 1.17.3
When using SWMR, are you limited to creating datasets only at creation time? Or are you supposed to be able to turn-off SWMR when it is not accessed by any other process?
Turning off SWMR is not possible (Jupyter Notebook, kernel restarted):
arr = np.array([.4, -.1, -.5, 8])
h5 = h5py.File("swmr_test.h5", 'w', libver='latest')
h5["np"] = arr
h5.swmr_mode = True
h5.swmr_mode = False
ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-22f0396ab8ff> in <module>
----> 1 h5.swmr_mode = False
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
/opt/miniconda3/envs/audio_tester/lib/python3.7/site-packages/h5py/_hl/files.py in swmr_mode(self, value)
312 self._swmr_mode = True
313 else:
--> 314 raise ValueError("It is not possible to forcibly switch SWMR mode off.")
315
316 def __init__(self, name, mode=None, driver=None,
ValueError: It is not possible to forcibly switch SWMR mode off.
If this is intended, it would be nice to clarify this in the documentation.
I know you don’t manage HDF5 itself, but could you help me with my thought process?
If with SWMR you’re limited to only dataset creation at file creation time, isn’t useless for almost any real environment setting? At least in terms of it being a database. In my case I’m working with Apache Airflow, meaning that independent process at various times will mostly read, but sometime write to a database. Since I’m dealing with audio data, columnar storage file formats are not the solution.
At first I thought Parallel HDF5 would be the solution here with MPI. However, you need to execute it with mpiexec
, meaning you need to coordinate them and you cannot trigger it from something like Airflow.
SWMR also seemed good, because to create new datasets you only needed to turn off SWMR for short moments (small risk at blocking readers is acceptable in my case). Then data can be added without interrupting readers, and without coordination. But this doesn’t seem to be how SWMR works.
Is it possible from h5py’s side to implement a forceful method to set the file back to h5.swmr_mode = False
? Even if this would block readers, it would make SWMR much more useful.
Related issue: https://github.com/h5py/h5py/issues/712
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (4 by maintainers)
Thank you for the detailed response. I have a better understanding of how access to HDF5 files in a multi-process setting works.
Corrupted files is indeed not what should be risked. I’m not familiar with the exact implementation, but what I imagined was that with
h5.swmr_mode = False
, only subsequent reading attempts fail. Preferably some mechanism in SWMR that checks at every read attempt if it is still in SWMR mode?Since it is only readers that will be blocked when set to False, those don’t have the power to corrupt anything, right? Although I can image that the data being pulled might be incorrect… which is also not desirable.
It doesn’t sound like HDF5 is going to work well for what you’re trying to do. SWMR doesn’t really make it easy to have multiple writers, even if only one process needs to write at a time - you’d need some external mechanism to coordinate which process can be the writer, and some clunky closing/reopening of files when the writer changes. SWMR is really meant for scenarios where you have one fixed data producer and one or more plain consumers.
I’m going to close this as I don’t think there’s anything to fix in h5py: you’re just hitting the limitations of the underlying HDF5 feature.