Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LocalCudaCluster startup-time scales with >1 second per GPU

See original GitHub issue

Spinning up a LocalCudaCluster using all 16 GPUs on a DGX-2 seems to consistently take 20+ seconds. Given the time seems to scale loosely linearly with the number of GPUs used, I wonder if some of this seemingly serial processing might be able to be done in parallel.

from dask_cuda import LocalCUDACluster
from dask.distributed import Client

%%time
cluster = LocalCUDACluster()
client = Client(cluster)
CPU times: user 889 ms, sys: 442 ms, total: 1.33 s
Wall time: 22.4 s

%%time

cluster = LocalCUDACluster(
    n_workers=3,
    CUDA_VISIBLE_DEVICES="0,1,2",
)
client = Client(cluster)
CPU times: user 595 ms, sys: 179 ms, total: 774 ms
Wall time: 3.42 s

Issue Analytics

State:
Created 4 years ago
Comments:23 (17 by maintainers)

Top GitHub Comments

1reaction

jakirkhamcommented, Jun 5, 2020

Thanks for adding this Peter! 😀

1reaction

jcristcommented, Dec 3, 2019

Yeah, this loop could be parallelized using asyncio.gather instead of linearly waiting for each worker: https://github.com/dask/distributed/blob/b60c4bfc5f46a9aa2f1aacbda21eac38abcc807e/distributed/deploy/spec.py#L315-L317.

Something like this should work (missing support for exception handling).

await asyncio.gather(*workers, return_exceptions=True)

Top Results From Across the Web

Source code for dask_cuda.local_cuda_cluster - Dask-CUDA

[docs]class LocalCUDACluster(LocalCluster): """A variant of ``dask.distributed. ... It creates one Dask worker process per GPU, and assigns each worker ...

API — dask-cuda 22.12.00a0+g2c99f5a documentation

LocalCluster that uses one GPU per process. This assigns a different CUDA_VISIBLE_DEVICES environment variable to each Dask worker process.

Microsecond-scale Preemption for Concurrent GPU ... - USENIX

This paper presents REEF, the first GPU-accelerated DNN inference serving system that enables microsecond-scale kernel preemption and controlled concurrent ...

Efficient MPI-based Communication for GPU-Accelerated ...

MPI4Dask speeds up the execution time for. cuPy and cuDF applications by an average of 1.71× and 2.91× respectively for 1 − 32...

Dask - OSCHINA - 中文开源技术交流社区

... plugins and add resource limits defining the number of GPUs per pod/worker. ... cluster w/ one worker per device from dask_cuda import...