Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-GPU training hangs

See original GitHub issue

Command run: bash ./tools/dist_train.sh configs/carafe/mask_rcnn_r50_fpn_carafe_1x_coco.py 8

No error, but hangs after output 2022-05-12 22:06:13,674 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. and stays around ~90% utilization (continuing to fluctuate) until killed.

PyTorch version: 1.7.1 CUDA version: 11.6

Haven’t been able to track down this exact problem in previous issues, but it seems like getting stuck while training in general is a known issue?

Issue Analytics

State:
Created a year ago
Comments:7

Top GitHub Comments

1reaction

jayphone17commented, Jul 5, 2022

same issue while using tools/dist_train.sh configs/yolox/yolox_s_8x8_300e_coco.py 8 for multi-GPUS training. no bugs but stop outputting logs 2022-07-05 13:49:18,605 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration. Have you solved it ?

0reactions

yuhua666commented, Dec 20, 2022

same issue

Read more comments on GitHub >

Top Results From Across the Web

Multi GPU training is stuck · Issue #9242 - GitHub

I am experienceing the same problem. ... The output is hanged after working for just one step of training_step(one batch for each gpu)....

Single node, multi GPU DistributedDataParallel training in ...

There is no error, the process just hangs. The images all seem to be successfully loaded into the instance memory, as a good...

Multi-gpu training hangs due to an `if` - PyTorch Forums

Hi, I discovered recently my 8-GPU training will hang if I have this if (using DDP, all GPUs saturate at 100%, happens randomly...

DDP strategy. Training hangs upon distributed GPU initialisation

Hello Everyone, Initially, I trained my model in single GPU environment. And it was working perfectly fine. But now I have increased GPU's...

Mutli GPU freezes on Roberta Pretraining - Beginners

I'm getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Potential bugs when mmdetection runs on PyTorch < 1.8

GPG error while building docker