Memory issue for rechunk
See original GitHub issueHi,
I have a large numpy
array (1000x1000x2000). When I tried to rechunk it into piecies in dask.ditributed
, the memory will keep increasing and raise memory error
eventually. My memory is 128 GB while the array is about 8 GB. So I am not clear what is going on here. My code is as follows
import numpy as np
import dask.array as da
from dask.distributed import Client
client = Client()
data = np.random.rand(1000,1000,2000).astype(np.float32)
future = client.scatter(data)
daskdata = da.from_delayed(future,shape=data.shape,dtype=np.float32)
daskdata2 = daskdata.rechunk(300,300,900).persist()
The code above works perfectly fine for a smaller array, e.g. 1000x1000x1000. So I don’t understand what’s the reason here.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Dask - Rechunk or array slicing causing large memory usage?
I was looking for some help with understanding some excessive (or possibly not) memory usage in my Dask processing chain. The problem comes ......
Read more >Rechunker: The missing link for chunked array analytics
One existing solution is to use Dask's rechunk function to create a new chunk structure lazily, on the fly, in memory. This works...
Read more >dask.array.rechunk - Dask documentation
The rechunk module defines: intersect_chunks: a function for converting chunks to new ... n\n" "A possible solution:\n x.compute_chunk_sizes()" ) old_to_new ...
Read more >Increase or decrease the number of chunks in the disk.frame
rechunk : Increase or decrease the number of chunks in the disk.frame. In disk.frame: Larger-than-RAM Disk-Based Data Manipulation Framework · Increase or ...
Read more >Parallel computing with dask — xarray 0.10.2 documentation
At that point, data is loaded into memory and computation proceeds in a ... To fix, rechunk into a single dask array chunk...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I suspect that your single large numpy array is larger than the intended memory for any of your individual processes. You might try one of the following options:
Client(processes=False)
Split locally first
Use a cluster with a different process/thread breakdown
Generally speaking Dask expects any particular value to be able to fit easily in the memory of one of your processes. If you use the defaults it creates as many processes as you have logical cores, which will break down your memory space.
Typically people don’t handle very large numpy arrays. They break them up somehow. Yes, HDF5 or NetCDF would work fine. You might also create your data in chunks. The right way to do things often depends on your application.