Multiprocess writes using `to_hdf5`
See original GitHub issueAfter seeing @stuartarchibald’s post on gitter, I was a bit curious how to_hdf5
actually worked and whether it would be viable to call from multiple processes.
Turns out to_hdf5
just calls store
underneath the hood while holding the file open locally. The biggest issue of course is that the file is held open in write mode on the scheduler (meaning nothing else can write to the HDF5 file without corrupting it). Since the type of locking was not specified and store
uses a threading.Lock
by default, this is also incompatible with the multiprocessing use case (i.e. threading.Lock
cannot be serialized/locked across processes). IOW the current implementation of to_hdf5
is not friendly even for locked parallel writes.
That said, it should be feasible to change to_hdf5
’s behavior to be more friendly for storage from multiple processes (though it will still need to be locked). In this case, it would still create the datasets initially (as it already does), but then would close the file. Instead of passing raw HDF5 Datasets as targets, a wrapper class would need to be used (to allow for pickling). The wrapper class would need to provide a __setitem__
method that would open the HDF5 file and write to the HDF5 Dataset at the selection specified in a process safe manner (probably with locket.lock_file
). Ideally it would provide a __getitem__
method as well. Doing this should allow for HDF5 file to be written to in parallel. This assumes the filesystem is very robust and syncing changes between different nodes.
An alternative to this proposal that would avoid locking would be to have data serialized back to the scheduler, which then writes each piece to the HDF5 file as it arrives. This would avoid the overhead of the previous strategy (and any potential issues locking) by guaranteeing only one process ever opens the HDF5 file and keeps open until everything is written. This strategy would continue to work well for non-parallel use cases with arguably less overhead than is present now. The only thing to do before writing out all results would be to optimize the graphs since store
would no longer be used.
Issue Analytics
- State:
- Created 6 years ago
- Comments:21 (19 by maintainers)
Top GitHub Comments
It’s worth noting that Zarr has support for copying data into HDF5 files.
ref: http://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.copy
From, my perspective, streaming computed arrays when using
distributed
to a single writer would be a nice compromise here. I’m not sure what machinery would be needed to make that happen but I can vouch for its potential application.In terms of virtual datasets, I’m headed this way with xarray as well. In our case we can use
save_mfdataset(datasets, paths)
to enable parallel writing to separate files.datasets
would be some partitioned version of a singlexarray.Dataset
made up of dask arrays.