Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory issue for rechunk

See original GitHub issue

Hi,

I have a large numpy array (1000x1000x2000). When I tried to rechunk it into piecies in dask.ditributed, the memory will keep increasing and raise memory error eventually. My memory is 128 GB while the array is about 8 GB. So I am not clear what is going on here. My code is as follows

import numpy as np
import dask.array as da
from dask.distributed import Client

client = Client()
data = np.random.rand(1000,1000,2000).astype(np.float32) 
future = client.scatter(data)
daskdata = da.from_delayed(future,shape=data.shape,dtype=np.float32)
daskdata2 = daskdata.rechunk(300,300,900).persist()

The code above works perfectly fine for a smaller array, e.g. 1000x1000x1000. So I don’t understand what’s the reason here.

Issue Analytics

State:
Created 6 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Aug 17, 2017

I suspect that your single large numpy array is larger than the intended memory for any of your individual processes. You might try one of the following options:

Use fewer processes with more memory. Consider using different options for LocalCluster other than the default.
Use just your single process with Client(processes=False)
Split your large numpy array into smaller pieces before you send it to the cluster.

Split locally first

x = da.from_array(x, chunks=...)
x = x.persist(get=dask.threaded.get)  # split locally
x = x.persist()  # persist to cluster

Use a cluster with a different process/thread breakdown

cluster = LocalCluster(n_workers=4, threads_per_worker=4)
client = Client(cluster)

Generally speaking Dask expects any particular value to be able to fit easily in the memory of one of your processes. If you use the defaults it creates as many processes as you have logical cores, which will break down your memory space.

0reactions

mrocklincommented, Aug 18, 2017

Typically people don’t handle very large numpy arrays. They break them up somehow. Yes, HDF5 or NetCDF would work fine. You might also create your data in chunks. The right way to do things often depends on your application.

Top Results From Across the Web

Dask - Rechunk or array slicing causing large memory usage?

I was looking for some help with understanding some excessive (or possibly not) memory usage in my Dask processing chain. The problem comes ......

Rechunker: The missing link for chunked array analytics

One existing solution is to use Dask's rechunk function to create a new chunk structure lazily, on the fly, in memory. This works...

dask.array.rechunk - Dask documentation

The rechunk module defines: intersect_chunks: a function for converting chunks to new ... n\n" "A possible solution:\n x.compute_chunk_sizes()" ) old_to_new ...

Increase or decrease the number of chunks in the disk.frame

rechunk : Increase or decrease the number of chunks in the disk.frame. In disk.frame: Larger-than-RAM Disk-Based Data Manipulation Framework · Increase or ...

Parallel computing with dask — xarray 0.10.2 documentation

At that point, data is loaded into memory and computation proceeds in a ... To fix, rechunk into a single dask array chunk...