question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Got stuck when training with multiple GPU using dist_train.sh

See original GitHub issue

All child threads getting stuck when training with multiple GPU using dist_train.sh With CUDA == 11.3, Pytorch == 1.10 After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171

I modified the code from

dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus
)

to

dist.init_process_group(
        backend=backend
)

and it worked.

I’m curious why this is so, and if someone else is having the same problem, you can try to do the same.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:9
  • Comments:9 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
Eaphancommented, Feb 3, 2022

Thank you for the bug report. It has been fixed in #784.

Can you help to double check whether it works now?

@sshaoshuai Thanks for your work. It’s ok now.

0reactions
Liaoqing-upcommented, May 14, 2022

@sshaoshuai torch=1.9.0 cuda=11.1. Got stuck at dist.init_process_group and the code is latest… In other distribued training project having the same code for init_process_group, it ran successfully. …

after I uncomment the lines mentioned in #784 (comment), it works.

So what is the cause of this stuck? I also counter this and will try your way…

Read more comments on GitHub >

github_iconTop Results From Across the Web

GPUs are stuck when using multiple GPUs to train
Hi, I bought 4 GPUs (three of RTX 8000 and one of Titan RTX) and 1 NV Link for RTX 8000 There is...
Read more >
Single node 2 GPU distributed training nccl-backend hanged
I tried to train MNIST using torch.distributed.launch nccl backend The launch command ... gloo backend works just fine; nccl got stuck ...
Read more >
Training — MMOCR 0.6.3 documentation
You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU. Here is the full...
Read more >
Training stuck (multi GPU, transformer) - Support - OpenNMT
Hello I am running a transformer on multiple GPUs (4 in total) I use the following command/setup: python $OPENNMT/train.py -data ...
Read more >
Distributed GPU training guide (SDK v2) - Azure
Show 2 more ... Learn more about how to use distributed GPU training code in Azure Machine ... Open MPI is included in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found