Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DDP training randomly stopping

See original GitHub issue

🐛 Bug

Edit: it randomly stops in the middle of training epoch as well.

After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after epoch 4 validation, sometimes after epoch 1 validation) and every time this happens, one of the machine shows 0% utilization while the others are consumed 100%. Memory is consumed as well from all gpus.

I have tried adding sync_dist=True in self.log and removed saving model checkpoint by top_k, referencing https://github.com/PyTorchLightning/pytorch-lightning/issues/5865. Following https://github.com/PyTorchLightning/pytorch-lightning/issues/9851, I already added seed_everything() as well. I checked that for training and validation, each gpu has same number of batches. However, the issue persists.

Any solution to this problem?

To Reproduce

I was unable to reproduce using the BoringModel, but as the stopping point is irregular even with same seed for pl.seed_everything, I believe it is a bug from ddp process itself.

Expected behavior

The training process should continue after validation.

Environment

PyTorch Lightning Version (e.g., 1.5.0): 1.4.9
PyTorch Version (e.g., 1.10): 1.9
Python version (e.g., 3.9): 3.8
OS (e.g., Linux): Linux
CUDA/cuDNN version: 11.2
GPU models and configuration: Google Cloud Platform A100 x8
How you installed PyTorch (conda, pip, source): pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

Here is my code for the trainer:

    checkpoint_callback = ModelCheckpoint(
        dirpath=log_dir,
        filename=cfg.exp_name + "-{epoch}-{val_auc:.3f}",
        every_n_epochs=1,
        save_top_k=-1,
    )

    trainer = pl.Trainer(
        callbacks=[
            checkpoint_callback,
            LearningRateMonitor(logging_interval="step"),
        ],
        max_epochs=100,
        accelerator="ddp",
        gpus=str(
            cfg.gpus
        ),  
        logger=pl.loggers.WandbLogger(project="news_recommendation", name=cfg.exp_name),
        val_check_interval= cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval, # cfg[cfg.experiment_type[cfg.current_stage]].val_check_interval,
        limit_train_batches=1.0,
        deterministic=True,
        num_sanity_val_steps=0,
        resume_from_checkpoint=cfg[cfg.experiment_type[cfg.current_stage]].load_ckpt
    )

cc @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

Issue Analytics

State:
Created 2 years ago
Reactions:6
Comments:25 (6 by maintainers)

Top GitHub Comments

1reaction

kayvane1commented, Dec 7, 2022

I faced a similar issue and it was not related to PyTorch Lightning, it was in my case a deadlock issue as explained here: https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks

You could try amending your dataloader with pin_memory = False and reducing the allocated number of workers. https://stackoverflow.com/questions/72183733/databricks-notebook-hanging-with-pytorch/72473053#72473053

1reaction

HareshKarnancommented, Jul 7, 2022

Same issue here with version 1.6.4

Top Results From Across the Web

DDP training randomly stopping · Issue #11242 · Lightning ...

After validation ends (100%), the training process randomly stops without any error log. The stopping point changes randomly (sometimes after ...

App bugs : r/ddpyoga

r/ddpyoga • 9 mo. ago ... It keeps crashing, stopping my workouts. I'm training like a horse trying to get maximum daily points...

Distributed data parallel freezes without error message

It is totally random whether the training will face a deadlock or not. For example, I had three network architectures, and the day...

Why Parallelized Training Might Not be Working for You

This is likely due to the fact that the gradients are being synchronized every time we call loss.backward() in our training code. The...

GPU training (Intermediate) - PyTorch Lightning - Read the Docs

Lightning supports multiple ways of doing distributed training. ... example for 3 GPUs DDP MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 ...