Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiprocess writes using `to_hdf5`

See original GitHub issue

After seeing @stuartarchibald’s post on gitter, I was a bit curious how to_hdf5 actually worked and whether it would be viable to call from multiple processes.

Turns out to_hdf5 just calls store underneath the hood while holding the file open locally. The biggest issue of course is that the file is held open in write mode on the scheduler (meaning nothing else can write to the HDF5 file without corrupting it). Since the type of locking was not specified and store uses a threading.Lock by default, this is also incompatible with the multiprocessing use case (i.e. threading.Lock cannot be serialized/locked across processes). IOW the current implementation of to_hdf5 is not friendly even for locked parallel writes.

That said, it should be feasible to change to_hdf5’s behavior to be more friendly for storage from multiple processes (though it will still need to be locked). In this case, it would still create the datasets initially (as it already does), but then would close the file. Instead of passing raw HDF5 Datasets as targets, a wrapper class would need to be used (to allow for pickling). The wrapper class would need to provide a __setitem__ method that would open the HDF5 file and write to the HDF5 Dataset at the selection specified in a process safe manner (probably with locket.lock_file). Ideally it would provide a __getitem__ method as well. Doing this should allow for HDF5 file to be written to in parallel. This assumes the filesystem is very robust and syncing changes between different nodes.

An alternative to this proposal that would avoid locking would be to have data serialized back to the scheduler, which then writes each piece to the HDF5 file as it arrives. This would avoid the overhead of the previous strategy (and any potential issues locking) by guaranteeing only one process ever opens the HDF5 file and keeps open until everything is written. This strategy would continue to work well for non-parallel use cases with arguably less overhead than is present now. The only thing to do before writing out all results would be to optimize the graphs since store would no longer be used.

Issue Analytics

State:
Created 6 years ago
Comments:21 (19 by maintainers)

Top GitHub Comments

1reaction

jakirkhamcommented, Mar 31, 2018

It’s worth noting that Zarr has support for copying data into HDF5 files.

ref: http://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.copy

0reactions

jhammancommented, Mar 12, 2018

From, my perspective, streaming computed arrays when using distributed to a single writer would be a nice compromise here. I’m not sure what machinery would be needed to make that happen but I can vouch for its potential application.

In terms of virtual datasets, I’m headed this way with xarray as well. In our case we can use save_mfdataset(datasets, paths) to enable parallel writing to separate files. datasets would be some partitioned version of a single xarray.Dataset made up of dask arrays.

Top Results From Across the Web

How to write in HDF5 file using multiprocessing?

Citing Wes McKinney from the book "Python for Data Analysis": HDF5 is not a database. It is best suited for write-once, read-many datasets....

writing from one process, reading from a multiprocessing.Pool

hi, i read in the docs that it's best to serialize access to hdf5 files. in my use case it would come in...

Parallel HDF5 — h5py 3.7.0 documentation

Read-only parallel access to HDF5 files works with no special preparation: each process should open the file independently and read data normally (avoid...

Concurrency: Parallel HDF5, Threading, and Multiprocessing

long-running, CPU-bound problems in which data I/O is a relatively small component. handled by the master process. It's the simplest way in Python...

HDF5 Parallel writing to single file in different groups from ...

We have started using Hdf5 file for saving the data. Data Received from different source of python programs, each python program executes on ......