[BUG] Task state with CuPy array
See original GitHub issueHi!
This issue has started as an issue in the dask_cuda
project, but it seems to be more adequate here. The original issue can be found here: https://github.com/rapidsai/dask-cuda/issues/93
I am currently working on implementing some statistical estimators using dask
, distributed
, cupy
and dask_cuda
for multi-GPU support.
My issue all started with the observation I am doing in the referred issue: the tasks screen of the dashboard indicates very high and unintended transfer-*
costs when performing some simple operations on the GPU. Please see the initial issue to have a screenshot of the task screen.
After some exploration of the execution via some custom SchedulerPlugin
(this is a great feature!), I realized that these very high transfer cost (>500ms) didn’t make sense. They were associated to gather
actions on tasks gathering chunks smaller than 1kb. See:
After some more debugging, I realized that the reason for that is that cupy
operations are asynchronous. You can find an explanation and example in the original issue. However, since distributed
is not aware of this, the library is mislead in measuring very low compute-*
times and very high transfer-*
time since the serialization requires a synchronous copying from the GPU and the CPU, which blocks until the computation is over and the data is copied.
I know that distributed
and cupy
are two very separate libraries, but this clearly seems to be a bug, or at least a misleading situation for people trying to combine cupy
and distributed
.
So, I would be very happy to work on a fix for this (on the cupy side, it would be something like adding cupy_array.device.synchronize()
, see https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.Device.html#cupy.cuda.Device.synchronize ), but I’m not sure if you would be interested in adding a custom support for cupy
, or if you would prefer to have (or already have) some generic solution for handling such situations.
Thanks!
NOTE: this is somewhat separated, but the attached screenshot was constructed via a mix of the graph computed by dask and the execution trace obtained with a SchedulerPlugin
. Would a cleaner / nicer version of this be interesting to a more general audience?
Edit: Added link to cupy docs on Device
.
Issue Analytics
- State:
- Created 4 years ago
- Comments:19 (19 by maintainers)
Top GitHub Comments
Thanks for working on this @matthieubulte!
I agree, I was also thinking of preparing a suite of performance tests to understand cupy with / wo sync and with different sync granularity. Then the same tests running on dask distributed with different optimization levels and maybe even the number of GPUs.
I think what might also be interesting is to see how different will the scheduling be under these different regimes.
Agree, I think the test will help understand this trade-off.