question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory issue for rechunk

See original GitHub issue

Hi,

I have a large numpy array (1000x1000x2000). When I tried to rechunk it into piecies in dask.ditributed, the memory will keep increasing and raise memory error eventually. My memory is 128 GB while the array is about 8 GB. So I am not clear what is going on here. My code is as follows

import numpy as np
import dask.array as da
from dask.distributed import Client

client = Client()
data = np.random.rand(1000,1000,2000).astype(np.float32) 
future = client.scatter(data)
daskdata = da.from_delayed(future,shape=data.shape,dtype=np.float32)
daskdata2 = daskdata.rechunk(300,300,900).persist()

The code above works perfectly fine for a smaller array, e.g. 1000x1000x1000. So I don’t understand what’s the reason here.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Aug 17, 2017

I suspect that your single large numpy array is larger than the intended memory for any of your individual processes. You might try one of the following options:

  1. Use fewer processes with more memory. Consider using different options for LocalCluster other than the default.
  2. Use just your single process with Client(processes=False)
  3. Split your large numpy array into smaller pieces before you send it to the cluster.

Split locally first

x = da.from_array(x, chunks=...)
x = x.persist(get=dask.threaded.get)  # split locally
x = x.persist()  # persist to cluster

Use a cluster with a different process/thread breakdown

cluster = LocalCluster(n_workers=4, threads_per_worker=4)
client = Client(cluster)

Generally speaking Dask expects any particular value to be able to fit easily in the memory of one of your processes. If you use the defaults it creates as many processes as you have logical cores, which will break down your memory space.

0reactions
mrocklincommented, Aug 18, 2017

Typically people don’t handle very large numpy arrays. They break them up somehow. Yes, HDF5 or NetCDF would work fine. You might also create your data in chunks. The right way to do things often depends on your application.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask - Rechunk or array slicing causing large memory usage?
I was looking for some help with understanding some excessive (or possibly not) memory usage in my Dask processing chain. The problem comes ......
Read more >
Rechunker: The missing link for chunked array analytics
One existing solution is to use Dask's rechunk function to create a new chunk structure lazily, on the fly, in memory. This works...
Read more >
dask.array.rechunk - Dask documentation
The rechunk module defines: intersect_chunks: a function for converting chunks to new ... n\n" "A possible solution:\n x.compute_chunk_sizes()" ) old_to_new ...
Read more >
Increase or decrease the number of chunks in the disk.frame
rechunk : Increase or decrease the number of chunks in the disk.frame. In disk.frame: Larger-than-RAM Disk-Based Data Manipulation Framework · Increase or ...
Read more >
Parallel computing with dask — xarray 0.10.2 documentation
At that point, data is loaded into memory and computation proceeds in a ... To fix, rechunk into a single dask array chunk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found