[SGD] Dcgan Example fails on Azure Cluster
See original GitHub issueWhat is the problem?
Ray SGD training operator initialization fails during DistributedDataParallel wrapping on Azure cluster, with NCCL backend (Gloo works fine).
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Ring 00 : 9[2] -> 7[3] via P2P/IPC
Traceback (most recent call last):
File "lf/models/dcgan/dcgan.py", line 311, in <module>
test_mode=args.smoke_test)
File "lf/models/dcgan/dcgan.py", line 257, in train_example
use_tqdm=True)
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/torch_trainer.py", line 262, in __init__
self._start_workers(self.max_replicas)
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/torch_trainer.py", line 323, in _start_workers
self.worker_group.start_workers(num_workers)
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/worker_group.py", line 215, in start_workers
ray.get(self._setup_operator())
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/worker.py", line 1428, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedTorchRunner.setup_operator() (pid=8000, ip=10.221.0.4)
File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 85, in setup_operator
scheduler_step_freq=self.scheduler_step_freq)
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 149, in __init__
self.setup(config)
File "lf/models/dcgan/dcgan.py", line 143, in setup
models=models, optimizers=optimizers, criterion=nn.BCELoss())
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 268, in register
for model in self._original_models
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 268, in <listcomp>
for model in self._original_models
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
self.broadcast_bucket_size)
File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8
(pid=17940, ip=10.221.0.5) ray-default-worker-5ee698a00:17940:18696 [0] NCCL INFO Ring 01 : 3[6] -> 5[5] via P2P/IPC
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO Ring 01 : 11 -> 2 [receive] via NET/IB/0
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO Ring 01 : 2[7] -> 3[6] via P2P/IPC
(pid=17949, ip=10.221.0.5)
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO transport/net_ib.cc:601 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO include/net.h:24 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO transport/net.cc:388 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO init.cc:679 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO init.cc:815 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO init.cc:951 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO Ring 01 : 5[5] -> 8[4] via P2P/IPC
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO Ring 01 : 8[4] -> 12[3] via P2P/IPC
(pid=17936, ip=10.221.0.5) ray-default-worker-5ee698a00:17936:18699 [0] NCCL INFO Ring 01 : 15[0] -> 14[1] via P2P/IPC
(pid=17936, ip=10.221.0.5) ray-default-worker-5ee698a00:17936:18699 [0] NCCL INFO Trees [0] 14->15->-1/-1/-1 [1] 14->15->-1/-1/-1
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO Ring 01 : 12[3] -> 13[2] via P2P/IPC
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO Ring 01 : 13[2] -> 14[1] via P2P/IPC
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO Ring 01 : 14[1] -> 15[0] via P2P/IPC
(pid=7999) ray-default-head-8f920c480:7999:8901 [0] NCCL INFO Ring 01 : 11[0] -> 10[1] via P2P/IPC
(pid=7999) ray-default-head-8f920c480:7999:8901 [0] NCCL INFO Trees [0] 10->11->-1/-1/-1 [1] 10->11->-1/-1/-1
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO Ring 01 : 0[7] -> 1[6] via P2P/IPC
(pid=8000)
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO transport/net_ib.cc:601 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO include/net.h:24 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO transport/net.cc:388 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO init.cc:679 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO init.cc:815 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO init.cc:951 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO Ring 01 : 4[5] -> 6[4] via P2P/IPC
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO Ring 01 : 6[4] -> 7[3] via P2P/IPC
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO Ring 01 : 7[3] -> 9[2] via P2P/IPC
(pid=7927) ray-default-head-8f920c480:7927:8905 [0] NCCL INFO Ring 01 : 10[1] -> 9[2] via P2P/IPC
(pid=8005) ray-default-head-8f920c480:8005:8895 [0] NCCL INFO Ring 01 : 1[6] -> 4[5] via P2P/IPC
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Ring 01 : 9[2] -> 10[1] via P2P/IPC
(pid=17940, ip=10.221.0.5) ray-default-worker-5ee698a00:17940:18696 [0] NCCL INFO Ring 01 : 3[6] -> 2[7] via P2P/IPC
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO Ring 01 : 5[5] -> 3[6] via P2P/IPC
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO Trees [0] 3->5->8/-1/-1 [1] 3->5->8/-1/-1
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO Ring 01 : 8[4] -> 5[5] via P2P/IPC
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO Trees [0] 5->8->12/-1/-1 [1] 5->8->12/-1/-1
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO comm 0x7edaac0028a0 rank 8 nranks 16 cudaDev 0 nvmlDev 4 - Init COMPLETE
(pid=17936, ip=10.221.0.5) ray-default-worker-5ee698a00:17936:18699 [0] NCCL INFO comm 0x7f90f40028a0 rank 15 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO Ring 01 : 12[3] -> 8[4] via P2P/IPC
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO Trees [0] 8->12->13/-1/-1 [1] 8->12->13/-1/-1
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO comm 0x7f6c240028a0 rank 12 nranks 16 cudaDev 0 nvmlDev 3 - Init COMPLETE
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO Ring 01 : 13[2] -> 12[3] via P2P/IPC
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO Trees [0] 12->13->14/-1/-1 [1] 12->13->14/-1/-1
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO comm 0x7f7d080028a0 rank 13 nranks 16 cudaDev 0 nvmlDev 2 - Init COMPLETE
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO Ring 01 : 14[1] -> 13[2] via P2P/IPC
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO Trees [0] 13->14->15/-1/-1 [1] 13->14->15/-1/-1
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO comm 0x7f8f8c0028a0 rank 14 nranks 16 cudaDev 0 nvmlDev 1 - Init COMPLETE
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO comm 0x7ed1700028a0 rank 5 nranks 16 cudaDev 0 nvmlDev 5 - Init COMPLETE
(pid=7999) ray-default-head-8f920c480:7999:8901 [0] NCCL INFO comm 0x7f7e440028a0 rank 11 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO Ring 01 : 4[5] -> 1[6] via P2P/IPC
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO Trees [0] 1->4->6/-1/-1 [1] 1->4->6/-1/-1
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO comm 0x7f247c0028a0 rank 4 nranks 16 cudaDev 0 nvmlDev 5 - Init COMPLETE
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO Ring 01 : 6[4] -> 4[5] via P2P/IPC
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO Trees [0] 4->6->7/-1/-1 [1] 4->6->7/-1/-1
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO comm 0x7efed00028a0 rank 6 nranks 16 cudaDev 0 nvmlDev 4 - Init COMPLETE
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO Ring 01 : 7[3] -> 6[4] via P2P/IPC
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO Trees [0] 6->7->9/-1/-1 [1] 6->7->9/-1/-1
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO comm 0x7ecd5c0028a0 rank 7 nranks 16 cudaDev 0 nvmlDev 3 - Init COMPLETE
(pid=7927) ray-default-head-8f920c480:7927:8905 [0] NCCL INFO Trees [0] 9->10->11/-1/-1 [1] 9->10->11/-1/-1
(pid=7927) ray-default-head-8f920c480:7927:8905 [0] NCCL INFO comm 0x7ed0fc0028a0 rank 10 nranks 16 cudaDev 0 nvmlDev 1 - Init COMPLETE
(pid=8005) ray-default-head-8f920c480:8005:8895 [0] NCCL INFO Ring 01 : 1[6] -> 0[7] via P2P/IPC
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Ring 01 : 9[2] -> 7[3] via P2P/IPC
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Trees [0] 7->9->10/-1/-1 [1] 7->9->10/-1/-1
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO comm 0x7f60b40028a0 rank 9 nranks 16 cudaDev 0 nvmlDev 2 - Init COMPLETE
Cluster information:
vmSize: Standard_ND40rs_v2
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: "1804-gen2"
imageVersion: 20.07.06
python dcgan.py --use-gpu --num-workers 16 --address=auto
Looks like it’s an NCCL error with that specific Azure instance type.
Ray version and other system information (Python version, TensorFlow version, OS): Ray version 1.0
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Training GAN's using Torch XLA - Part 1 [DC-GAN] - Kaggle
This example is part 1 of 3 series for training a GAN on TPU using Torch XLA package. This notebook illustrates distributed (data...
Read more >Troubleshoot Azure Kubernetes Service cluster creation issues
Learn about basic troubleshooting methods to use when you can't create or deploy an Azure Kubernetes Service (AKS) cluster.
Read more >A Guide to Population Based Training — Ray 2.2.0
Let's walk through how to use this algorithm. Function API with Population Based Training. Replaying a PBT run. Example: DCGAN with PBT.
Read more >PRIVATE POST-GAN BOOSTING - OpenReview
(Private PGB), a differentially private method that combines samples produced by the sequence of generators obtained during GAN training to create a high- ......
Read more >A Study on Generative Adversarial Networks Exacerbating ...
the DCGAN instead of three. Figure 10: Distributions of classifications for gender from Microsoft Azure Cognitive. Services Face API on a random sample...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

pytorch version 1.16 nccl
@andrijazz, is this still a problem with Ray Train? We’ve made various improvements here to avoid NCCL issues, particularly with the setting of CUDA_VISIBLE_DEVICES for Ray actors.
I am going to close this for now, but please re-open if there are still issues with the latest Ray Train.