Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Got stuck when training with multiple GPU using dist_train.sh

See original GitHub issue

All child threads getting stuck when training with multiple GPU using dist_train.sh With CUDA == 11.3, Pytorch == 1.10 After diagnosis, I found it was stuck at https://github.com/open-mmlab/OpenPCDet/blob/master/pcdet/utils/common_utils.py#L166-L171

I modified the code from

dist.init_process_group(
        backend=backend,
        init_method='tcp://127.0.0.1:%d' % tcp_port,
        rank=local_rank,
        world_size=num_gpus
)

dist.init_process_group(
        backend=backend
)

and it worked.

I’m curious why this is so, and if someone else is having the same problem, you can try to do the same.

Issue Analytics

State:
Created 2 years ago
Reactions:9
Comments:9 (1 by maintainers)

Top GitHub Comments

1reaction

Eaphancommented, Feb 3, 2022

Thank you for the bug report. It has been fixed in #784.

Can you help to double check whether it works now?

@sshaoshuai Thanks for your work. It’s ok now.

0reactions

Liaoqing-upcommented, May 14, 2022

@sshaoshuai torch=1.9.0 cuda=11.1. Got stuck at dist.init_process_group and the code is latest… In other distribued training project having the same code for init_process_group, it ran successfully. …

after I uncomment the lines mentioned in #784 (comment), it works.

So what is the cause of this stuck? I also counter this and will try your way…