Chunk/memory management for CuPy-backed arrays
See original GitHub issueWhat happened: I’m trying to understand what the best approach to chunking is for doing dask operations on CuPy-backed arrays, and whether or not the current behaviour is the expected behaviour or a bug. The calculation below is much slower on the GPU than on the CPU.
To test this you’ll need a GPU and an appropriate installation of CuPy. I am using a GTX 1080TI with 11 GB of memory and CuPy 9.0.0.
The following is a simple example where I try to use the GPU to sum a larger-than-memory array, with a chunk size of 1 GB:
import cupy as cp
import numpy as np
import dask.array as da
huge_array = da.ones(
(5000, 5000, 200),
chunks=(5000, 5000, 5),
dtype=float)
huge_array.nbytes / 1e9 # 40 GB in size
np.prod(huge_array.chunksize, dtype=float) * huge_array.dtype.itemsize / 1e9 # chunk size of 1 GB
huge_array = huge_array.map_blocks(cp.asarray) # make it a CuPy-backed array
array_sum = da.sum(huge_array)
array_sum.compute()
Upon compute, I get the following warning after a little while:
C:\Users\thomasaar\Miniconda3\envs\gpu2\lib\site-packages\cupy\_creation\from_data.py:66: PerformanceWarning: Using synchronous transfer as pinned memory (1000000000 bytes) could not be allocated. This generally occurs because of insufficient host memory. The original error was: cudaErrorMemoryAllocation: out of memory return _core.array(a, dtype, False, order)
What you expected to happen:
I expected that the above example would run without memory errors. The chunks of the ones array should be discardable after summing, IMO, meaning that we shouldn’t end up in a situation where it can’t allocate the 1GB complained about in the warning above.
Some thoughts
Does dask create the ones chunks on the GPU, or create it on the CPU first and then copy to the GPU? Could it be either of these that is the slow step? Or is CuPy not freeing up memory when it should?
Environment:
- Dask version: ‘2021.04.1’
- Python version: 3.9.4
- Operating System: Windows 10
- Install method (conda, pip, source): conda for all packages.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (5 by maintainers)

Top Related StackOverflow Question
To add to @quasiben 's answer, we have a
setup_memory_poolfunction in Dask-CUDA benchmarks to do exactly that.@astrophysaxist unfortunately, we don’t really have a good example
It can be controlled with a combination of Dask and CuPy: https://docs.dask.org/en/latest/configuration-reference.html#rmm +
Or with Dask-CUDA directly:
Can I ask you to file an issue on Dask-CUDA issue tracker: https://github.com/rapidsai/dask-cuda/
There we can find the way to best document how to use RMM with Dask