Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainig stuck before first epoch with `ddp` and multi-gpu

See original GitHub issue

🐛 Bug

Training is stuck when using ddp, gpus=[0, 1], and num_sanity_val_steps=2. The two validation checks are executed. The code execution seems to be stuck at self.scaler.step(optimizer) in pre_optimizer_step in pytorch_lightning/plugins/precision/native_amp.py and more specifically in pytorch at https://github.com/pytorch/pytorch/blob/4f8b986e28736b59bc46cd0873a0f36fdaa6f5b8/torch/cuda/amp/grad_scaler.py#L284

If I either use dp, or gpus=[0], or num_sanity_val_steps=0, the training runs normally (so any one of the changes means the code works)

Also, the code works with torch==1.8.1+cu111, pytorch-lightning==1.3.8 torch==1.10.2+cu113, pytorch-lightning==1.3.8

code does not work with: torch==1.10.2+cu113, pytorch-lightning==1.4.0 torch==1.10.2+cu113, pytorch-lightning==1.5.10

To Reproduce

Annoyingly, I cannot reproduce the code with the BoringModel.

Environment

PyTorch Lightning Version: 1.5.10
PyTorch Version: 1.10.2+cu113
Python version: 3.7
OS: Ubuntu 18.04
CUDA/cuDNN version: 11.6
GPU models and configuration: 2*2080Ti
How you installed PyTorch: pip
Any other relevant information: Code works pre pytorch-lightning 1.4.0

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @carmocca

Issue Analytics

State:
Created 2 years ago
Reactions:6
Comments:15 (6 by maintainers)

Top GitHub Comments

4reactions

AljoStcommented, Feb 23, 2022

Just as a clarification: For me the training is stuck before the first training step is executed. So after the validation-checks and before the second batch

2reactions

qmpzzpmqcommented, Mar 1, 2022

I had same issue with it. And I replace the DDP sampler by myself, and set “drop_last=True” to make sure each node have the same number of batch. But It still on stuck on the last. But the funny things is if the limit_train_batch set to a int. it works fine.