question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainig stuck before first epoch with `ddp` and multi-gpu

See original GitHub issue

🐛 Bug

Training is stuck when using ddp, gpus=[0, 1], and num_sanity_val_steps=2. The two validation checks are executed. The code execution seems to be stuck at self.scaler.step(optimizer) in pre_optimizer_step in pytorch_lightning/plugins/precision/native_amp.py and more specifically in pytorch at https://github.com/pytorch/pytorch/blob/4f8b986e28736b59bc46cd0873a0f36fdaa6f5b8/torch/cuda/amp/grad_scaler.py#L284

If I either use dp, or gpus=[0], or num_sanity_val_steps=0, the training runs normally (so any one of the changes means the code works)

Also, the code works with torch==1.8.1+cu111, pytorch-lightning==1.3.8 torch==1.10.2+cu113, pytorch-lightning==1.3.8

code does not work with: torch==1.10.2+cu113, pytorch-lightning==1.4.0 torch==1.10.2+cu113, pytorch-lightning==1.5.10

To Reproduce

Annoyingly, I cannot reproduce the code with the BoringModel.

Environment

  • PyTorch Lightning Version: 1.5.10
  • PyTorch Version: 1.10.2+cu113
  • Python version: 3.7
  • OS: Ubuntu 18.04
  • CUDA/cuDNN version: 11.6
  • GPU models and configuration: 2*2080Ti
  • How you installed PyTorch: pip
  • Any other relevant information: Code works pre pytorch-lightning 1.4.0

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @carmocca

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:6
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
AljoStcommented, Feb 23, 2022

Just as a clarification: For me the training is stuck before the first training step is executed. So after the validation-checks and before the second batch

2reactions
qmpzzpmqcommented, Mar 1, 2022

I had same issue with it. And I replace the DDP sampler by myself, and set “drop_last=True” to make sure each node have the same number of batch. But It still on stuck on the last. But the funny things is if the limit_train_batch set to a int. it works fine.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trainig stuck before first epoch with ddp and multi-gpu #11910
Bug Training is stuck when using ddp, gpus=[0, 1], and num_sanity_val_steps=2. The two validation checks are executed.
Read more >
Stucks on 8gpu training setting - distributed - PyTorch Forums
I trained them on 1, 4, 5, 8 gpu environment using DDP. However, all of 8gpu and 5gpu training attempts, are stuck and...
Read more >
Stucks on 8gpu training setting - Lightning AI
I am using Pytorch Lightning Framework to train the Text to Text Transformer model (google/mt5-base at main). I trained them on 1, 4,...
Read more >
Stuck In First Epoch When Training CNN model in google Colab
Check if TensorFlow is using a GPU or not. You can try reducing batch size.
Read more >
Ray Trainer prepare_model gets stuck
prepare_model(student)” and on the print with “Wrapping provided model in DDP.”. I've let the script run for several minutes without any visible ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found