question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DDP training randomly stopping

See original GitHub issue

🐛 Bug

Edit: it randomly stops in the middle of training epoch as well.

After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after epoch 4 validation, sometimes after epoch 1 validation) and every time this happens, one of the machine shows 0% utilization while the others are consumed 100%. Memory is consumed as well from all gpus.

I have tried adding sync_dist=True in self.log and removed saving model checkpoint by top_k, referencing https://github.com/PyTorchLightning/pytorch-lightning/issues/5865. Following https://github.com/PyTorchLightning/pytorch-lightning/issues/9851, I already added seed_everything() as well. I checked that for training and validation, each gpu has same number of batches. However, the issue persists.

Any solution to this problem?

image

스크린샷 2021-12-23 오후 9 31 58

To Reproduce

I was unable to reproduce using the BoringModel, but as the stopping point is irregular even with same seed for pl.seed_everything, I believe it is a bug from ddp process itself.

Expected behavior

The training process should continue after validation.

Environment

  • PyTorch Lightning Version (e.g., 1.5.0): 1.4.9
  • PyTorch Version (e.g., 1.10): 1.9
  • Python version (e.g., 3.9): 3.8
  • OS (e.g., Linux): Linux
  • CUDA/cuDNN version: 11.2
  • GPU models and configuration: Google Cloud Platform A100 x8
  • How you installed PyTorch (conda, pip, source): pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

Here is my code for the trainer:

    checkpoint_callback = ModelCheckpoint(
        dirpath=log_dir,
        filename=cfg.exp_name + "-{epoch}-{val_auc:.3f}",
        every_n_epochs=1,
        save_top_k=-1,
    )

    trainer = pl.Trainer(
        callbacks=[
            checkpoint_callback,
            LearningRateMonitor(logging_interval="step"),
        ],
        max_epochs=100,
        accelerator="ddp",
        gpus=str(
            cfg.gpus
        ),  
        logger=pl.loggers.WandbLogger(project="news_recommendation", name=cfg.exp_name),
        val_check_interval= cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval, # cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval,
        limit_train_batches=1.0,
        deterministic=True,
        num_sanity_val_steps=0,
        resume_from_checkpoint=cfg[cfg.experiment_type[cfg.current_stage]].load_ckpt
    )

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:6
  • Comments:25 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
kayvane1commented, Dec 7, 2022

I faced a similar issue and it was not related to PyTorch Lightning, it was in my case a deadlock issue as explained here: https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks

You could try amending your dataloader with pin_memory = False and reducing the allocated number of workers. https://stackoverflow.com/questions/72183733/databricks-notebook-hanging-with-pytorch/72473053#72473053

1reaction
HareshKarnancommented, Jul 7, 2022

Same issue here with version 1.6.4

Read more comments on GitHub >

github_iconTop Results From Across the Web

DDP training randomly stopping · Issue #11242 · Lightning ...
After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after ...
Read more >
App bugs : r/ddpyoga
r/ddpyoga • 9 mo. ago ... It keeps crashing, stopping my workouts. I'm training like a horse trying to get maximum daily points...
Read more >
Distributed data parallel freezes without error message
It is totally random whether the training will face a deadlock or not. For example, I had three network architectures, and the day...
Read more >
Why Parallelized Training Might Not be Working for You
This is likely due to the fact that the gradients are being synchronized every time we call loss.backward() in our training code. The...
Read more >
GPU training (Intermediate) - PyTorch Lightning - Read the Docs
Lightning supports multiple ways of doing distributed training. ... example for 3 GPUs DDP MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found