The program stops at line `dist.init() `
See original GitHub issueAfter I run CUDA_VISIBLE_DEVICES=1 torchpack dist-run -np 1 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml 2>&1 | tee ./train.log
The program stops at line dist.init()
in train.py and cannot continue to run.
Is there something wrong, could you please help me to solve this problem?
Enviromnet: cudatookit 10.2 pytorch 1.8.0 python 3.6 openmpi 4.1.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
How to solve dist.init_process_group from hanging (or ...
We have been using the environment variable initialization method throughout this tutorial.
Read more >How to solve dist.init_process_group from hanging ... - GitHub
The issue is that the MASTER_PORT env variable needs to be the same for all processes in the group. As you have it...
Read more >Distributed communication package - torch.distributed - PyTorch
The distributed package comes with a distributed key-value store, which can be used to share information between processes in the group as well...
Read more >After upgrade to Big Sur git stopped working - Apple Developer
Hit the same issue, needed to upgrade the command line tools version prior to running the xcode-select command. All worked fine afterwards. Posted...
Read more >Python3 pip3 install broken on Ubuntu
There is something wrong with your pip3 so remove it and reinstall it. Open the terminal and type: sudo apt purge python3-pip sudo...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @zhijian-liu ,
Thanks for your replay!
I have changed the command as
torchpack dist-run -np 3 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml 2>&1 | tee ./train.log
The program still stops at line
dist.init()
in train.py and cannot continue to run with only printedFailed to import tensorflow
on the screen.What is wrong with it?
I finally choose to not use the MPI.The MPI related constant parameter is setted as the context.py’s default value.