Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High memory usage for file to file streaming - h5py and zarr

See original GitHub issue

What happened:

Streaming dask arrays from disk back to disk appears to cause the entire array to be loaded into RAM. Same behaviour for both hdf5 files with h5py and zarr.

The test below creates a large random array, writes it to disk, then uses dask to reload the array and save in another file. I observed the same problems when saving into the same hd5 file (test2a).

The correct behaviour appears to be present for dask 2.28, where RAM usage is far lower, capped at a moderate number of chunks.

What you expected to happen:

RAM usage ought to be constrained to a small number of chunks - the behaviour seen in dask 2.28.

Minimal Complete Verifiable Example:

Single threaded example


import h5py
import dask
import dask.array as da
import os.path

from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)
dc = dask.config.set(pool=pool, scheduler='threads')


def test2():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True)
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r+')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True)
    print("Finished stage 2")


def test2a():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True)
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True)
    print("Finished stage 2")


def test2b():
    # reading from one zarr file into another
    fname = '/tmp/random_test2b.zarr'
    fname2 = '/tmp/random_test2b2.zarr'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_zarr(fname, component='data')
    print("Finished stage 1")
    daskimg = da.from_zarr(fname, component='/data')
    # Now copy to another zarr file
    print(daskimg)
    daskimg.to_zarr(fname2, component='/data')
    print("Finished stage 2")
            
    
test2()
test2a()
test2b()

Multi-threaded example for dask 2.28


import h5py
import dask
import dask.array as da
import os.path

def test2():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True, compression="gzip")
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r+')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True, compression="gzip")
    print("Finished stage 2")


def test2a():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True, compression="gzip")
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True, compression="gzip")
    print("Finished stage 2")


def test2b():
    # reading from one zarr file into another
    fname = '/tmp/random_test2b.zarr'
    fname2 = '/tmp/random_test2b2.zarr'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_zarr(fname, component='data')
    print("Finished stage 1")
    daskimg = da.from_zarr(fname, component='/data')
    # Now copy to another zarr file
    print(daskimg)
    daskimg.to_zarr(fname2, component='/data')
    print("Finished stage 2")
            
    
test2()
test2a()
test2b()

Anything else we need to know?:

I was not able to set the thread count to 1 for dask 2.28.

I used the following to test RAM usage:


top -d 1 -b | grep "^processid" > log

Environment:

Dask version: 2021.7.2+21.g1c4a8422 and 2.28
Python version: 3.8.1
Operating System: ubuntu 18.04
Install method (conda, pip, source): for 2.28 - pip into conda environment for 2021.7.2+21.g1c4a8422 - source