question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The program stops at line `dist.init() `

See original GitHub issue

@zhijian-liu

After I run CUDA_VISIBLE_DEVICES=1 torchpack dist-run -np 1 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml 2>&1 | tee ./train.log The program stops at line dist.init() in train.py and cannot continue to run.

Is there something wrong, could you please help me to solve this problem?

Enviromnet: cudatookit 10.2 pytorch 1.8.0 python 3.6 openmpi 4.1.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mtli77commented, Jun 2, 2021

The environment seems correct. Are you going to train the model with only 1 GPU?

Hi @zhijian-liu ,

Thanks for your replay!

I have changed the command as torchpack dist-run -np 3 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml 2>&1 | tee ./train.log

The program still stops at line dist.init() in train.py and cannot continue to run with only printed Failed to import tensorflow on the screen.

What is wrong with it?

0reactions
LeopoldACCcommented, Dec 1, 2021

I finally choose to not use the MPI.The MPI related constant parameter is setted as the context.py’s default value.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve dist.init_process_group from hanging (or ...
We have been using the environment variable initialization method throughout this tutorial.
Read more >
How to solve dist.init_process_group from hanging ... - GitHub
The issue is that the MASTER_PORT env variable needs to be the same for all processes in the group. As you have it...
Read more >
Distributed communication package - torch.distributed - PyTorch
The distributed package comes with a distributed key-value store, which can be used to share information between processes in the group as well...
Read more >
After upgrade to Big Sur git stopped working - Apple Developer
Hit the same issue, needed to upgrade the command line tools version prior to running the xcode-select command. All worked fine afterwards. Posted...
Read more >
Python3 pip3 install broken on Ubuntu
There is something wrong with your pip3 so remove it and reinstall it. Open the terminal and type: sudo apt purge python3-pip sudo...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found