question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-GPU training hangs

See original GitHub issue

Command run: bash ./tools/dist_train.sh configs/carafe/mask_rcnn_r50_fpn_carafe_1x_coco.py 8

No error, but hangs after output 2022-05-12 22:06:13,674 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. and stays around ~90% utilization (continuing to fluctuate) until killed.

PyTorch version: 1.7.1 CUDA version: 11.6

Haven’t been able to track down this exact problem in previous issues, but it seems like getting stuck while training in general is a known issue?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
jayphone17commented, Jul 5, 2022

same issue while using tools/dist_train.sh configs/yolox/yolox_s_8x8_300e_coco.py 8 for multi-GPUS training. no bugs but stop outputting logs 2022-07-05 13:49:18,605 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. Have you solved it ?

0reactions
yuhua666commented, Dec 20, 2022

same issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multi GPU training is stuck · Issue #9242 - GitHub
I am experienceing the same problem. ... The output is hanged after working for just one step of training_step(one batch for each gpu)....
Read more >
Single node, multi GPU DistributedDataParallel training in ...
There is no error, the process just hangs. The images all seem to be successfully loaded into the instance memory, as a good...
Read more >
Multi-gpu training hangs due to an `if` - PyTorch Forums
Hi, I discovered recently my 8-GPU training will hang if I have this if (using DDP, all GPUs saturate at 100%, happens randomly...
Read more >
DDP strategy. Training hangs upon distributed GPU initialisation
Hello Everyone, Initially, I trained my model in single GPU environment. And it was working perfectly fine. But now I have increased GPU's...
Read more >
Mutli GPU freezes on Roberta Pretraining - Beginners
I'm getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found