question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Task state with CuPy array

See original GitHub issue

Hi!

This issue has started as an issue in the dask_cuda project, but it seems to be more adequate here. The original issue can be found here: https://github.com/rapidsai/dask-cuda/issues/93

I am currently working on implementing some statistical estimators using dask, distributed, cupy and dask_cuda for multi-GPU support.

My issue all started with the observation I am doing in the referred issue: the tasks screen of the dashboard indicates very high and unintended transfer-*costs when performing some simple operations on the GPU. Please see the initial issue to have a screenshot of the task screen.

After some exploration of the execution via some custom SchedulerPlugin (this is a great feature!), I realized that these very high transfer cost (>500ms) didn’t make sense. They were associated to gather actions on tasks gathering chunks smaller than 1kb. See: image

After some more debugging, I realized that the reason for that is that cupy operations are asynchronous. You can find an explanation and example in the original issue. However, since distributed is not aware of this, the library is mislead in measuring very low compute-* times and very high transfer-* time since the serialization requires a synchronous copying from the GPU and the CPU, which blocks until the computation is over and the data is copied.

I know that distributed and cupy are two very separate libraries, but this clearly seems to be a bug, or at least a misleading situation for people trying to combine cupy and distributed.

So, I would be very happy to work on a fix for this (on the cupy side, it would be something like adding cupy_array.device.synchronize(), see https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.cuda.Device.html#cupy.cuda.Device.synchronize ), but I’m not sure if you would be interested in adding a custom support for cupy, or if you would prefer to have (or already have) some generic solution for handling such situations.

Thanks!

NOTE: this is somewhat separated, but the attached screenshot was constructed via a mix of the graph computed by dask and the execution trace obtained with a SchedulerPlugin. Would a cleaner / nicer version of this be interesting to a more general audience?

Edit: Added link to cupy docs on Device.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:19 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
pentschevcommented, Jul 15, 2019

Thanks for working on this @matthieubulte!

1reaction
matthieubultecommented, Jul 15, 2019

I think this is something that should be benchmarked to verify how performance is affected.

I agree, I was also thinking of preparing a suite of performance tests to understand cupy with / wo sync and with different sync granularity. Then the same tests running on dask distributed with different optimization levels and maybe even the number of GPUs.

I think what might also be interesting is to see how different will the scheduling be under these different regimes.

In general, we should have the least synchronization calls as possible, so it would be good if there’s a configuration to enable/disable synchronization.

Agree, I think the test will help understand this trade-off.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CuPy Documentation - Read the Docs
CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in.
Read more >
Passing CuPy array through scipy low pass filter
Scipy seems to work with CPU tasks, so maybe it can't work with GPU cupy arrays. I wasnt able to find any library...
Read more >
Basics of CuPy — CuPy 11.4.0 documentation
CuPy is a GPU array backend that implements a subset of NumPy interface. ... and the queued tasks on the same stream will...
Read more >
CUDA C++ Best Practices Guide
Code samples throughout the guide omit error checking for conciseness. ... swapping of registers or other state need occur when switching among GPU...
Read more >
Tuple of CuPy arrays - Numba Cuda - Support: How do I do
For more context, _cpd is a 3 dimensional cupy array and the entire operation below ... Thanks for bringing this up - you've...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found