Got stuck when training with multiple GPU using dist_train.sh
See original GitHub issueAll child threads getting stuck when training with multiple GPU using dist_train.sh With CUDA == 11.3, Pytorch == 1.10 After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171
I modified the code from
dist.init_process_group(
backend=backend,
init_method='tcp://127.0.0.1:%d' % tcp_port,
rank=local_rank,
world_size=num_gpus
)
to
dist.init_process_group(
backend=backend
)
and it worked.
I’m curious why this is so, and if someone else is having the same problem, you can try to do the same.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:9
- Comments:9 (1 by maintainers)
Top Results From Across the Web
GPUs are stuck when using multiple GPUs to train
Hi, I bought 4 GPUs (three of RTX 8000 and one of Titan RTX) and 1 NV Link for RTX 8000 There is...
Read more >Single node 2 GPU distributed training nccl-backend hanged
I tried to train MNIST using torch.distributed.launch nccl backend The launch command ... gloo backend works just fine; nccl got stuck ...
Read more >Training — MMOCR 0.6.3 documentation
You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU. Here is the full...
Read more >Training stuck (multi GPU, transformer) - Support - OpenNMT
Hello I am running a transformer on multiple GPUs (4 in total) I use the following command/setup: python $OPENNMT/train.py -data ...
Read more >Distributed GPU training guide (SDK v2) - Azure
Show 2 more ... Learn more about how to use distributed GPU training code in Azure Machine ... Open MPI is included in...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@sshaoshuai Thanks for your work. It’s ok now.
So what is the cause of this stuck? I also counter this and will try your way…