question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SGD] Dcgan Example fails on Azure Cluster

See original GitHub issue

What is the problem?

Ray SGD training operator initialization fails during DistributedDataParallel wrapping on Azure cluster, with NCCL backend (Gloo works fine).

(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Ring 00 : 9[2] -> 7[3] via P2P/IPC
Traceback (most recent call last):
  File "lf/models/dcgan/dcgan.py", line 311, in <module>
    test_mode=args.smoke_test)
  File "lf/models/dcgan/dcgan.py", line 257, in train_example
    use_tqdm=True)
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/torch_trainer.py", line 262, in __init__
    self._start_workers(self.max_replicas)
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/torch_trainer.py", line 323, in _start_workers
    self.worker_group.start_workers(num_workers)
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/worker_group.py", line 215, in start_workers
    ray.get(self._setup_operator())
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/worker.py", line 1428, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::DistributedTorchRunner.setup_operator() (pid=8000, ip=10.221.0.4)
  File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 438, in ray._raylet.execute_task.function_executor
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/distributed_torch_runner.py", line 85, in setup_operator
    scheduler_step_freq=self.scheduler_step_freq)
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 149, in __init__
    self.setup(config)
  File "lf/models/dcgan/dcgan.py", line 143, in setup
    models=models, optimizers=optimizers, criterion=nn.BCELoss())
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 268, in register
    for model in self._original_models
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/ray/util/sgd/torch/training_operator.py", line 268, in <listcomp>
    for model in self._original_models
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 333, in __init__
    self.broadcast_bucket_size)
  File "/home/ubuntu/.local/share/virtualenvs/trendage-UlsgU2xo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 549, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8
(pid=17940, ip=10.221.0.5) ray-default-worker-5ee698a00:17940:18696 [0] NCCL INFO Ring 01 : 3[6] -> 5[5] via P2P/IPC
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO Ring 01 : 11 -> 2 [receive] via NET/IB/0
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO Ring 01 : 2[7] -> 3[6] via P2P/IPC
(pid=17949, ip=10.221.0.5) 
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO transport/net_ib.cc:601 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO include/net.h:24 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO transport/net.cc:388 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO init.cc:679 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO init.cc:815 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO init.cc:951 -> 2
(pid=17949, ip=10.221.0.5) ray-default-worker-5ee698a00:17949:18689 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO Ring 01 : 5[5] -> 8[4] via P2P/IPC
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO Ring 01 : 8[4] -> 12[3] via P2P/IPC
(pid=17936, ip=10.221.0.5) ray-default-worker-5ee698a00:17936:18699 [0] NCCL INFO Ring 01 : 15[0] -> 14[1] via P2P/IPC
(pid=17936, ip=10.221.0.5) ray-default-worker-5ee698a00:17936:18699 [0] NCCL INFO Trees [0] 14->15->-1/-1/-1 [1] 14->15->-1/-1/-1
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO Ring 01 : 12[3] -> 13[2] via P2P/IPC
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO Ring 01 : 13[2] -> 14[1] via P2P/IPC
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO Ring 01 : 14[1] -> 15[0] via P2P/IPC
(pid=7999) ray-default-head-8f920c480:7999:8901 [0] NCCL INFO Ring 01 : 11[0] -> 10[1] via P2P/IPC
(pid=7999) ray-default-head-8f920c480:7999:8901 [0] NCCL INFO Trees [0] 10->11->-1/-1/-1 [1] 10->11->-1/-1/-1
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO Ring 01 : 0[7] -> 1[6] via P2P/IPC
(pid=8000) 
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] misc/ibvwrap.cc:252 NCCL WARN Call to ibv_reg_mr failed
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO transport/net_ib.cc:601 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO include/net.h:24 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO transport/net.cc:388 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO init.cc:679 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO init.cc:815 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO init.cc:951 -> 2
(pid=8000) ray-default-head-8f920c480:8000:8891 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO Ring 01 : 4[5] -> 6[4] via P2P/IPC
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO Ring 01 : 6[4] -> 7[3] via P2P/IPC
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO Ring 01 : 7[3] -> 9[2] via P2P/IPC
(pid=7927) ray-default-head-8f920c480:7927:8905 [0] NCCL INFO Ring 01 : 10[1] -> 9[2] via P2P/IPC
(pid=8005) ray-default-head-8f920c480:8005:8895 [0] NCCL INFO Ring 01 : 1[6] -> 4[5] via P2P/IPC
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Ring 01 : 9[2] -> 10[1] via P2P/IPC
(pid=17940, ip=10.221.0.5) ray-default-worker-5ee698a00:17940:18696 [0] NCCL INFO Ring 01 : 3[6] -> 2[7] via P2P/IPC
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO Ring 01 : 5[5] -> 3[6] via P2P/IPC
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO Trees [0] 3->5->8/-1/-1 [1] 3->5->8/-1/-1
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO Ring 01 : 8[4] -> 5[5] via P2P/IPC
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO Trees [0] 5->8->12/-1/-1 [1] 5->8->12/-1/-1
(pid=17944, ip=10.221.0.5) ray-default-worker-5ee698a00:17944:18693 [0] NCCL INFO comm 0x7edaac0028a0 rank 8 nranks 16 cudaDev 0 nvmlDev 4 - Init COMPLETE
(pid=17936, ip=10.221.0.5) ray-default-worker-5ee698a00:17936:18699 [0] NCCL INFO comm 0x7f90f40028a0 rank 15 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO Ring 01 : 12[3] -> 8[4] via P2P/IPC
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO Trees [0] 8->12->13/-1/-1 [1] 8->12->13/-1/-1
(pid=17939, ip=10.221.0.5) ray-default-worker-5ee698a00:17939:18697 [0] NCCL INFO comm 0x7f6c240028a0 rank 12 nranks 16 cudaDev 0 nvmlDev 3 - Init COMPLETE
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO Ring 01 : 13[2] -> 12[3] via P2P/IPC
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO Trees [0] 12->13->14/-1/-1 [1] 12->13->14/-1/-1
(pid=17938, ip=10.221.0.5) ray-default-worker-5ee698a00:17938:18688 [0] NCCL INFO comm 0x7f7d080028a0 rank 13 nranks 16 cudaDev 0 nvmlDev 2 - Init COMPLETE
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO Ring 01 : 14[1] -> 13[2] via P2P/IPC
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO Trees [0] 13->14->15/-1/-1 [1] 13->14->15/-1/-1
(pid=17934, ip=10.221.0.5) ray-default-worker-5ee698a00:17934:18687 [0] NCCL INFO comm 0x7f8f8c0028a0 rank 14 nranks 16 cudaDev 0 nvmlDev 1 - Init COMPLETE
(pid=17937, ip=10.221.0.5) ray-default-worker-5ee698a00:17937:18695 [0] NCCL INFO comm 0x7ed1700028a0 rank 5 nranks 16 cudaDev 0 nvmlDev 5 - Init COMPLETE
(pid=7999) ray-default-head-8f920c480:7999:8901 [0] NCCL INFO comm 0x7f7e440028a0 rank 11 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO Ring 01 : 4[5] -> 1[6] via P2P/IPC
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO Trees [0] 1->4->6/-1/-1 [1] 1->4->6/-1/-1
(pid=8007) ray-default-head-8f920c480:8007:8903 [0] NCCL INFO comm 0x7f247c0028a0 rank 4 nranks 16 cudaDev 0 nvmlDev 5 - Init COMPLETE
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO Ring 01 : 6[4] -> 4[5] via P2P/IPC
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO Trees [0] 4->6->7/-1/-1 [1] 4->6->7/-1/-1
(pid=7991) ray-default-head-8f920c480:7991:8899 [0] NCCL INFO comm 0x7efed00028a0 rank 6 nranks 16 cudaDev 0 nvmlDev 4 - Init COMPLETE
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO Ring 01 : 7[3] -> 6[4] via P2P/IPC
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO Trees [0] 6->7->9/-1/-1 [1] 6->7->9/-1/-1
(pid=7951) ray-default-head-8f920c480:7951:8893 [0] NCCL INFO comm 0x7ecd5c0028a0 rank 7 nranks 16 cudaDev 0 nvmlDev 3 - Init COMPLETE
(pid=7927) ray-default-head-8f920c480:7927:8905 [0] NCCL INFO Trees [0] 9->10->11/-1/-1 [1] 9->10->11/-1/-1
(pid=7927) ray-default-head-8f920c480:7927:8905 [0] NCCL INFO comm 0x7ed0fc0028a0 rank 10 nranks 16 cudaDev 0 nvmlDev 1 - Init COMPLETE
(pid=8005) ray-default-head-8f920c480:8005:8895 [0] NCCL INFO Ring 01 : 1[6] -> 0[7] via P2P/IPC
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Ring 01 : 9[2] -> 7[3] via P2P/IPC
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO Trees [0] 7->9->10/-1/-1 [1] 7->9->10/-1/-1
(pid=7903) ray-default-head-8f920c480:7903:8897 [0] NCCL INFO comm 0x7f60b40028a0 rank 9 nranks 16 cudaDev 0 nvmlDev 2 - Init COMPLETE

Cluster information:

vmSize: Standard_ND40rs_v2
        # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
        imagePublisher: microsoft-dsvm
        imageOffer: ubuntu-1804
        imageSku: "1804-gen2"
        imageVersion: 20.07.06

python dcgan.py --use-gpu --num-workers 16 --address=auto

Looks like it’s an NCCL error with that specific Azure instance type.

Ray version and other system information (Python version, TensorFlow version, OS): Ray version 1.0

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
andrijazzcommented, Oct 28, 2020

pytorch version 1.16 nccl

libnccl-dev/unknown,now 2.2.13-1+cuda9.2 amd64 [installed]
  NVIDIA Collectives Communication Library (NCCL) Development Files

libnccl2/unknown,now 2.2.13-1+cuda9.2 amd64 [installed]
  NVIDIA Collectives Communication Library (NCCL) Runtime
0reactions
amogkamcommented, Apr 11, 2022

@andrijazz, is this still a problem with Ray Train? We’ve made various improvements here to avoid NCCL issues, particularly with the setting of CUDA_VISIBLE_DEVICES for Ray actors.

I am going to close this for now, but please re-open if there are still issues with the latest Ray Train.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training GAN's using Torch XLA - Part 1 [DC-GAN] - Kaggle
This example is part 1 of 3 series for training a GAN on TPU using Torch XLA package. This notebook illustrates distributed (data...
Read more >
Troubleshoot Azure Kubernetes Service cluster creation issues
Learn about basic troubleshooting methods to use when you can't create or deploy an Azure Kubernetes Service (AKS) cluster.
Read more >
A Guide to Population Based Training — Ray 2.2.0
Let's walk through how to use this algorithm. Function API with Population Based Training. Replaying a PBT run. Example: DCGAN with PBT.
Read more >
PRIVATE POST-GAN BOOSTING - OpenReview
(Private PGB), a differentially private method that combines samples produced by the sequence of generators obtained during GAN training to create a high- ......
Read more >
A Study on Generative Adversarial Networks Exacerbating ...
the DCGAN instead of three. Figure 10: Distributions of classifications for gender from Microsoft Azure Cognitive. Services Face API on a random sample...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found