Multi-GPU training hangs
See original GitHub issueCommand run:
bash ./tools/dist_train.sh configs/carafe/mask_rcnn_r50_fpn_carafe_1x_coco.py 8
No error, but hangs after output 2022-05-12 22:06:13,674 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
and stays around ~90% utilization (continuing to fluctuate) until killed.
PyTorch version: 1.7.1 CUDA version: 11.6
Haven’t been able to track down this exact problem in previous issues, but it seems like getting stuck while training in general is a known issue?
Issue Analytics
- State:
- Created a year ago
- Comments:7
Top Results From Across the Web
Multi GPU training is stuck · Issue #9242 - GitHub
I am experienceing the same problem. ... The output is hanged after working for just one step of training_step(one batch for each gpu)....
Read more >Single node, multi GPU DistributedDataParallel training in ...
There is no error, the process just hangs. The images all seem to be successfully loaded into the instance memory, as a good...
Read more >Multi-gpu training hangs due to an `if` - PyTorch Forums
Hi, I discovered recently my 8-GPU training will hang if I have this if (using DDP, all GPUs saturate at 100%, happens randomly...
Read more >DDP strategy. Training hangs upon distributed GPU initialisation
Hello Everyone, Initially, I trained my model in single GPU environment. And it was working perfectly fine. But now I have increased GPU's...
Read more >Mutli GPU freezes on Roberta Pretraining - Beginners
I'm getting annoying crashes when I try to train a roberta model with two Titan X GPUs. I see in the documentation that...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
same issue while using
tools/dist_train.sh configs/yolox/yolox_s_8x8_300e_coco.py 8
for multi-GPUS training. no bugs but stop outputting logs2022-07-05 13:49:18,605 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
Have you solved it ?same issue