DDP training randomly stopping
See original GitHub issue🐛 Bug
Edit: it randomly stops in the middle of training epoch as well.
After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after epoch 4 validation, sometimes after epoch 1 validation) and every time this happens, one of the machine shows 0% utilization while the others are consumed 100%. Memory is consumed as well from all gpus.
I have tried adding sync_dist=True
in self.log and removed saving model checkpoint by top_k
, referencing https://github.com/PyTorchLightning/pytorch-lightning/issues/5865. Following https://github.com/PyTorchLightning/pytorch-lightning/issues/9851, I already added seed_everything()
as well. I checked that for training and validation, each gpu has same number of batches. However, the issue persists.
Any solution to this problem?

To Reproduce
I was unable to reproduce using the BoringModel, but as the stopping point is irregular even with same seed for pl.seed_everything
, I believe it is a bug from ddp process itself.
Expected behavior
The training process should continue after validation.
Environment
- PyTorch Lightning Version (e.g., 1.5.0): 1.4.9
- PyTorch Version (e.g., 1.10): 1.9
- Python version (e.g., 3.9): 3.8
- OS (e.g., Linux): Linux
- CUDA/cuDNN version: 11.2
- GPU models and configuration: Google Cloud Platform A100 x8
- How you installed PyTorch (
conda
,pip
, source):pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
- If compiling from source, the output of
torch.__config__.show()
: - Any other relevant information:
Additional context
Here is my code for the trainer:
checkpoint_callback = ModelCheckpoint(
dirpath=log_dir,
filename=cfg.exp_name + "-{epoch}-{val_auc:.3f}",
every_n_epochs=1,
save_top_k=-1,
)
trainer = pl.Trainer(
callbacks=[
checkpoint_callback,
LearningRateMonitor(logging_interval="step"),
],
max_epochs=100,
accelerator="ddp",
gpus=str(
cfg.gpus
),
logger=pl.loggers.WandbLogger(project="news_recommendation", name=cfg.exp_name),
val_check_interval= cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval, # cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval,
limit_train_batches=1.0,
deterministic=True,
num_sanity_val_steps=0,
resume_from_checkpoint=cfg[cfg.experiment_type[cfg.current_stage]].load_ckpt
)
cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:25 (6 by maintainers)
I faced a similar issue and it was not related to PyTorch Lightning, it was in my case a deadlock issue as explained here: https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks
You could try amending your dataloader with
pin_memory
=False
and reducing the allocated number of workers. https://stackoverflow.com/questions/72183733/databricks-notebook-hanging-with-pytorch/72473053#72473053Same issue here with version 1.6.4