question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LocalCudaCluster startup-time scales with >1 second per GPU

See original GitHub issue

Spinning up a LocalCudaCluster using all 16 GPUs on a DGX-2 seems to consistently take 20+ seconds. Given the time seems to scale loosely linearly with the number of GPUs used, I wonder if some of this seemingly serial processing might be able to be done in parallel.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

%%time
cluster = LocalCUDACluster()
client = Client(cluster)
CPU times: user 889 ms, sys: 442 ms, total: 1.33 s
Wall time: 22.4 s
%%time
​
cluster = LocalCUDACluster(
    n_workers=3,
    CUDA_VISIBLE_DEVICES="0,1,2",
)
client = Client(cluster)
CPU times: user 595 ms, sys: 179 ms, total: 774 ms
Wall time: 3.42 s

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:23 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
jakirkhamcommented, Jun 5, 2020

Thanks for adding this Peter! 😀

1reaction
jcristcommented, Dec 3, 2019

Yeah, this loop could be parallelized using asyncio.gather instead of linearly waiting for each worker: https://github.com/dask/distributed/blob/b60c4bfc5f46a9aa2f1aacbda21eac38abcc807e/distributed/deploy/spec.py#L315-L317.

Something like this should work (missing support for exception handling).

await asyncio.gather(*workers, return_exceptions=True)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for dask_cuda.local_cuda_cluster - Dask-CUDA
[docs]class LocalCUDACluster(LocalCluster): """A variant of ``dask.distributed. ... It creates one Dask worker process per GPU, and assigns each worker ...
Read more >
API — dask-cuda 22.12.00a0+g2c99f5a documentation
LocalCluster that uses one GPU per process. This assigns a different CUDA_VISIBLE_DEVICES environment variable to each Dask worker process.
Read more >
Microsecond-scale Preemption for Concurrent GPU ... - USENIX
This paper presents REEF, the first GPU-accelerated DNN inference serving system that enables microsecond-scale kernel preemption and controlled concurrent ...
Read more >
Efficient MPI-based Communication for GPU-Accelerated ...
MPI4Dask speeds up the execution time for. cuPy and cuDF applications by an average of 1.71× and 2.91× respectively for 1 − 32...
Read more >
Dask - OSCHINA - 中文开源技术交流社区
... plugins and add resource limits defining the number of GPUs per pod/worker. ... cluster w/ one worker per device from dask_cuda import...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found