[BUG] Task state with CuPy array
See original GitHub issueHi!
This issue has started as an issue in the dask_cuda project, but it seems to be more adequate here. The original issue can be found here: https://github.com/rapidsai/dask-cuda/issues/93
I am currently working on implementing some statistical estimators using dask, distributed, cupy and dask_cuda for multi-GPU support.
My issue all started with the observation I am doing in the referred issue: the tasks screen of the dashboard indicates very high and unintended transfer-*costs when performing some simple operations on the GPU. Please see the initial issue to have a screenshot of the task screen.
After some exploration of the execution via some custom SchedulerPlugin (this is a great feature!), I realized that these very high transfer cost (>500ms) didn’t make sense. They were associated to gather actions on tasks gathering chunks smaller than 1kb. See:

After some more debugging, I realized that the reason for that is that cupy operations are asynchronous. You can find an explanation and example in the original issue. However, since distributed is not aware of this, the library is mislead in measuring very low compute-* times and very high transfer-* time since the serialization requires a synchronous copying from the GPU and the CPU, which blocks until the computation is over and the data is copied.
I know that distributed and cupy are two very separate libraries, but this clearly seems to be a bug, or at least a misleading situation for people trying to combine cupy and distributed.
So, I would be very happy to work on a fix for this (on the cupy side, it would be something like adding cupy_array.device.synchronize(), see https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.Device.html#cupy.cuda.Device.synchronize ), but I’m not sure if you would be interested in adding a custom support for cupy, or if you would prefer to have (or already have) some generic solution for handling such situations.
Thanks!
NOTE: this is somewhat separated, but the attached screenshot was constructed via a mix of the graph computed by dask and the execution trace obtained with a SchedulerPlugin. Would a cleaner / nicer version of this be interesting to a more general audience?
Edit: Added link to cupy docs on Device.
Issue Analytics
- State:
- Created 4 years ago
- Comments:19 (19 by maintainers)

Top Related StackOverflow Question
Thanks for working on this @matthieubulte!
I agree, I was also thinking of preparing a suite of performance tests to understand cupy with / wo sync and with different sync granularity. Then the same tests running on dask distributed with different optimization levels and maybe even the number of GPUs.
I think what might also be interesting is to see how different will the scheduling be under these different regimes.
Agree, I think the test will help understand this trade-off.