High memory usage for file to file streaming - h5py and zarr
See original GitHub issueWhat happened:
Streaming dask arrays from disk back to disk appears to cause the entire array to be loaded into RAM. Same behaviour for both hdf5 files with h5py and zarr.
The test below creates a large random array, writes it to disk, then uses dask to reload the array and save in another file. I observed the same problems when saving into the same hd5 file (test2a
).
The correct behaviour appears to be present for dask 2.28, where RAM usage is far lower, capped at a moderate number of chunks.
What you expected to happen:
RAM usage ought to be constrained to a small number of chunks - the behaviour seen in dask 2.28.
Minimal Complete Verifiable Example:
Single threaded example
import h5py
import dask
import dask.array as da
import os.path
from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)
dc = dask.config.set(pool=pool, scheduler='threads')
def test2():
# reading from one hdf5 file into another
fname = '/tmp/random_test2.hdf5'
fname2 = '/tmp/random_test2a.hdf5'
x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
# write to hdf
x.to_hdf5(fname, '/data', chunks=True)
print("Finished stage 1")
h5 = h5py.File(fname, 'r+')
daskimg = da.from_array(h5['/data'])
# Now copy to another hdf5 file
print(daskimg)
daskimg.to_hdf5(fname2, '/data', chunks=True)
print("Finished stage 2")
def test2a():
# reading from one hdf5 file into another
fname = '/tmp/random_test2.hdf5'
fname2 = '/tmp/random_test2a.hdf5'
x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
# write to hdf
x.to_hdf5(fname, '/data', chunks=True)
print("Finished stage 1")
h5 = h5py.File(fname, 'r')
daskimg = da.from_array(h5['/data'])
# Now copy to another hdf5 file
print(daskimg)
daskimg.to_hdf5(fname2, '/data', chunks=True)
print("Finished stage 2")
def test2b():
# reading from one zarr file into another
fname = '/tmp/random_test2b.zarr'
fname2 = '/tmp/random_test2b2.zarr'
x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
# write to hdf
x.to_zarr(fname, component='data')
print("Finished stage 1")
daskimg = da.from_zarr(fname, component='/data')
# Now copy to another zarr file
print(daskimg)
daskimg.to_zarr(fname2, component='/data')
print("Finished stage 2")
test2()
test2a()
test2b()
Multi-threaded example for dask 2.28
import h5py
import dask
import dask.array as da
import os.path
def test2():
# reading from one hdf5 file into another
fname = '/tmp/random_test2.hdf5'
fname2 = '/tmp/random_test2a.hdf5'
x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
# write to hdf
x.to_hdf5(fname, '/data', chunks=True, compression="gzip")
print("Finished stage 1")
h5 = h5py.File(fname, 'r+')
daskimg = da.from_array(h5['/data'])
# Now copy to another hdf5 file
print(daskimg)
daskimg.to_hdf5(fname2, '/data', chunks=True, compression="gzip")
print("Finished stage 2")
def test2a():
# reading from one hdf5 file into another
fname = '/tmp/random_test2.hdf5'
fname2 = '/tmp/random_test2a.hdf5'
x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
# write to hdf
x.to_hdf5(fname, '/data', chunks=True, compression="gzip")
print("Finished stage 1")
h5 = h5py.File(fname, 'r')
daskimg = da.from_array(h5['/data'])
# Now copy to another hdf5 file
print(daskimg)
daskimg.to_hdf5(fname2, '/data', chunks=True, compression="gzip")
print("Finished stage 2")
def test2b():
# reading from one zarr file into another
fname = '/tmp/random_test2b.zarr'
fname2 = '/tmp/random_test2b2.zarr'
x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
# write to hdf
x.to_zarr(fname, component='data')
print("Finished stage 1")
daskimg = da.from_zarr(fname, component='/data')
# Now copy to another zarr file
print(daskimg)
daskimg.to_zarr(fname2, component='/data')
print("Finished stage 2")
test2()
test2a()
test2b()
Anything else we need to know?:
I was not able to set the thread count to 1 for dask 2.28.
I used the following to test RAM usage:
top -d 1 -b | grep "^processid" > log
Environment:
- Dask version: 2021.7.2+21.g1c4a8422 and 2.28
- Python version: 3.8.1
- Operating System: ubuntu 18.04
- Install method (conda, pip, source): for 2.28 - pip into conda environment for 2021.7.2+21.g1c4a8422 - source
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (6 by maintainers)
@jrbourbeau Confirmed - #8040 behaves in the expect way regarding RAM. Great news!
Now if only h5py was as fast as zarr!
Thanks everyone - I think that covers it! I can keep my current project moving forward now that I have these options.