Trainig stuck before first epoch with `ddp` and multi-gpu
See original GitHub issue🐛 Bug
Training is stuck when using ddp
, gpus=[0, 1]
, and num_sanity_val_steps=2
. The two validation checks are executed. The code execution seems to be stuck at self.scaler.step(optimizer)
in pre_optimizer_step
in pytorch_lightning/plugins/precision/native_amp.py
and more specifically in pytorch at
https://github.com/pytorch/pytorch/blob/4f8b986e28736b59bc46cd0873a0f36fdaa6f5b8/torch/cuda/amp/grad_scaler.py#L284
If I either use dp
, or gpus=[0]
, or num_sanity_val_steps=0
, the training runs normally (so any one of the changes means the code works)
Also, the code works with
torch==1.8.1+cu111
, pytorch-lightning==1.3.8
torch==1.10.2+cu113
, pytorch-lightning==1.3.8
code does not work with:
torch==1.10.2+cu113
, pytorch-lightning==1.4.0
torch==1.10.2+cu113
, pytorch-lightning==1.5.10
To Reproduce
Annoyingly, I cannot reproduce the code with the BoringModel.
Environment
- PyTorch Lightning Version: 1.5.10
- PyTorch Version: 1.10.2+cu113
- Python version: 3.7
- OS: Ubuntu 18.04
- CUDA/cuDNN version: 11.6
- GPU models and configuration: 2*2080Ti
- How you installed PyTorch: pip
- Any other relevant information:
Code works pre pytorch-lightning
1.4.0
cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7 @carmocca
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:15 (6 by maintainers)
Just as a clarification: For me the training is stuck before the first training step is executed. So after the validation-checks and before the second batch
I had same issue with it. And I replace the DDP sampler by myself, and set “drop_last=True” to make sure each node have the same number of batch. But It still on stuck on the last. But the funny things is if the limit_train_batch set to a int. it works fine.