LocalCudaCluster startup-time scales with >1 second per GPU
See original GitHub issueSpinning up a LocalCudaCluster using all 16 GPUs on a DGX-2 seems to consistently take 20+ seconds. Given the time seems to scale loosely linearly with the number of GPUs used, I wonder if some of this seemingly serial processing might be able to be done in parallel.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
%%time
cluster = LocalCUDACluster()
client = Client(cluster)
CPU times: user 889 ms, sys: 442 ms, total: 1.33 s
Wall time: 22.4 s
%%time
cluster = LocalCUDACluster(
n_workers=3,
CUDA_VISIBLE_DEVICES="0,1,2",
)
client = Client(cluster)
CPU times: user 595 ms, sys: 179 ms, total: 774 ms
Wall time: 3.42 s
Issue Analytics
- State:
- Created 4 years ago
- Comments:23 (17 by maintainers)
Top Results From Across the Web
Source code for dask_cuda.local_cuda_cluster - Dask-CUDA
[docs]class LocalCUDACluster(LocalCluster): """A variant of ``dask.distributed. ... It creates one Dask worker process per GPU, and assigns each worker ...
Read more >API — dask-cuda 22.12.00a0+g2c99f5a documentation
LocalCluster that uses one GPU per process. This assigns a different CUDA_VISIBLE_DEVICES environment variable to each Dask worker process.
Read more >Microsecond-scale Preemption for Concurrent GPU ... - USENIX
This paper presents REEF, the first GPU-accelerated DNN inference serving system that enables microsecond-scale kernel preemption and controlled concurrent ...
Read more >Efficient MPI-based Communication for GPU-Accelerated ...
MPI4Dask speeds up the execution time for. cuPy and cuDF applications by an average of 1.71× and 2.91× respectively for 1 − 32...
Read more >Dask - OSCHINA - 中文开源技术交流社区
... plugins and add resource limits defining the number of GPUs per pod/worker. ... cluster w/ one worker per device from dask_cuda import...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for adding this Peter! 😀
Yeah, this loop could be parallelized using
asyncio.gather
instead of linearly waiting for each worker: https://github.com/dask/distributed/blob/b60c4bfc5f46a9aa2f1aacbda21eac38abcc807e/distributed/deploy/spec.py#L315-L317.Something like this should work (missing support for exception handling).