question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High memory usage for file to file streaming - h5py and zarr

See original GitHub issue

What happened:

Streaming dask arrays from disk back to disk appears to cause the entire array to be loaded into RAM. Same behaviour for both hdf5 files with h5py and zarr.

The test below creates a large random array, writes it to disk, then uses dask to reload the array and save in another file. I observed the same problems when saving into the same hd5 file (test2a).

The correct behaviour appears to be present for dask 2.28, where RAM usage is far lower, capped at a moderate number of chunks.

What you expected to happen:

RAM usage ought to be constrained to a small number of chunks - the behaviour seen in dask 2.28.

Minimal Complete Verifiable Example:

Single threaded example


import h5py
import dask
import dask.array as da
import os.path

from concurrent.futures import ThreadPoolExecutor
pool = ThreadPoolExecutor(max_workers=1)
dc = dask.config.set(pool=pool, scheduler='threads')


def test2():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True)
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r+')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True)
    print("Finished stage 2")


def test2a():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True)
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True)
    print("Finished stage 2")


def test2b():
    # reading from one zarr file into another
    fname = '/tmp/random_test2b.zarr'
    fname2 = '/tmp/random_test2b2.zarr'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_zarr(fname, component='data')
    print("Finished stage 1")
    daskimg = da.from_zarr(fname, component='/data')
    # Now copy to another zarr file
    print(daskimg)
    daskimg.to_zarr(fname2, component='/data')
    print("Finished stage 2")
            
    
test2()
test2a()
test2b()

Multi-threaded example for dask 2.28


import h5py
import dask
import dask.array as da
import os.path

def test2():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True, compression="gzip")
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r+')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True, compression="gzip")
    print("Finished stage 2")


def test2a():
    # reading from one hdf5 file into another
    fname = '/tmp/random_test2.hdf5'
    fname2 = '/tmp/random_test2a.hdf5'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_hdf5(fname, '/data', chunks=True, compression="gzip")
    print("Finished stage 1")
    h5 = h5py.File(fname, 'r')
    daskimg = da.from_array(h5['/data'])
    # Now copy to another hdf5 file
    print(daskimg)
    daskimg.to_hdf5(fname2, '/data', chunks=True, compression="gzip")
    print("Finished stage 2")


def test2b():
    # reading from one zarr file into another
    fname = '/tmp/random_test2b.zarr'
    fname2 = '/tmp/random_test2b2.zarr'
    x = da.random.random((128, 512*10, 512*10), chunks=(128, 128, 128))
    # write to hdf
    x.to_zarr(fname, component='data')
    print("Finished stage 1")
    daskimg = da.from_zarr(fname, component='/data')
    # Now copy to another zarr file
    print(daskimg)
    daskimg.to_zarr(fname2, component='/data')
    print("Finished stage 2")
            
    
test2()
test2a()
test2b()

Anything else we need to know?:

I was not able to set the thread count to 1 for dask 2.28.

I used the following to test RAM usage:


top -d 1 -b | grep "^processid" > log

Environment:

  • Dask version: 2021.7.2+21.g1c4a8422 and 2.28
  • Python version: 3.8.1
  • Operating System: ubuntu 18.04
  • Install method (conda, pip, source): for 2.28 - pip into conda environment for 2021.7.2+21.g1c4a8422 - source

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
richardbearecommented, Aug 18, 2021

@jrbourbeau Confirmed - #8040 behaves in the expect way regarding RAM. Great news!

Now if only h5py was as fast as zarr!

1reaction
richardbearecommented, Aug 18, 2021

Thanks everyone - I think that covers it! I can keep my current project moving forward now that I have these options.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is there an analysis speed or memory usage advantage to ...
My question is: is there any speed or memory usage benefit to using HDF5 to store and analyze these cubes over storing them...
Read more >
Loading NumPy arrays from disk: mmap() vs ... - Python⇒Speed
Learn how to load larger-than-memory NumPy arrays from disk using either mmap() (using numpy.memmap), or the very similar Zarr and HDF5 file ......
Read more >
Tutorial — zarr 2.13.3 documentation - Read the Docs
Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and...
Read more >
Saving and Loading Large Datasets · Issue #2784 - GitHub
I tried memory mapping NTFS compressed files, and it worked fairly well overall, but the compression ratio was less than a chunked HDF5...
Read more >
Loading NumPy arrays from disk: mmap() vs. Zarr/HDF5
My FPB file format is a FourCC format with several chunks. One chunk stores all N fingerprints, cache size aligned.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found