question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stalls with DDP multi-GPU setup

See original GitHub issue

🐛 Bug

My training / validation step gets hung when using ddp on 4-GPU AWS instance. Usually it happens at the end of the first epoch, but sometimes in the middle of it. Code runs fine on 1 GPU. My model checkpoint is a very basic set up

checkpoint_callback = pl.callbacks.ModelCheckpoint(
        args.checkpointdir,
        save_last=True)

as is the trainer

        trainer = pl.Trainer(
            progress_bar_refresh_rate=1000,
            log_every_n_steps=1000,
            max_epochs=model_config['epochs'],
            gradient_clip_val=0.5,
            gpus=-1,
            accelerator='ddp',
            plugins=[pl.plugins.DDPPlugin(find_unused_parameters=True)],
            callbacks=[checkpoint_callback])

I know there is a related issue https://github.com/PyTorchLightning/pytorch-lightning/issues/4612, but in my case the hanging happens non-deterministically.

Funnily if I use subset of data using --limit_train_batches the trains runs fine. However, I monitor GPU mem usage and it never goes above 91/92%.

Any suggestions would be most appreciated.

Is there a way to at least induce an error message and failure. For example, on AWS SageMaker, stalled model does not fail the job and it continues accumulating costs. I do not want to use other parallel backends as they are much slower making 4-GPU parallelism cost-ineffective.

Expected behavior

Model runs in multi-gpu DDP model without stalling.

Environment

Using AWS p3* instances

  • CUDA:
    • GPU:
      • Tesla V100-SXM2-16GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.20.1
    • pyTorch_debug: False
    • pyTorch_version: 1.4.0 (also tried 1.6.0)
    • pytorch-lightning: 1.2.3
    • tqdm: 4.57.0
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor: x86_64
    • python: 3.8.5
    • version: #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
Rilwan-Acommented, Mar 21, 2021

I’ve also faced a problem with code stuck on “initializing ddp”. After much work, I solved my problem by simply adding num_sanity_val_steps=0 as an argument to Trainer(...)

1reaction
NielsRoggecommented, Mar 19, 2021

Same issue, it doesn’t even start for me.

This is my hardware (2 GPU’s on one machine):

!nvidia-smi
Thu Mar 18 13:43:52 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   59C    P8    27W / 260W |   2370MiB / 11018MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   58C    P8    23W / 260W |      4MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9630      C   ...env_datascouts/bin/python     2281MiB |
|    0   N/A  N/A     28915      G   /usr/lib/xorg/Xorg                 27MiB |
|    0   N/A  N/A     28947      G   /usr/bin/gnome-shell               56MiB |
+-----------------------------------------------------------------------------+

My code (the MNIST example from this tutorial):

from pathlib import Path
import requests
import pickle
import gzip

import math
import torch
import torch.nn.functional as F
import pytorch_lightning as pl
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

URL = "https://github.com/pytorch/tutorials/raw/master/_static/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
n, c = x_train.shape
x_train, x_train.shape, y_train.min(), y_train.max()
print(x_train, y_train)
print(x_train.shape)
print(y_train.min(), y_train.max())

train_ds = TensorDataset(x_train, y_train)

class MNISTModel(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))

    def forward(self, xb):
        return xb @ self.weights + self.bias

    def training_step(self, batch, batch_nb):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

# Init our model
mnist_model = MNISTModel()

train_loader = DataLoader(train_ds, batch_size=32)

# Initialize a trainer
trainer = pl.Trainer(gpus=[0,1], accelerator='ddp', max_epochs=3, progress_bar_refresh_rate=20)

# Train the model ⚡
trainer.fit(mnist_model, train_loader)

However, when running this in a Jupyter notebook, all I get is:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Missing logger folder: /home/(...)/lightning_logs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

UPDATE: the following work:

  • gpus=1
  • gpus=2, accelerator=“dp”

But

  • gpus=2, accelerator=“ddp” does not work.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Training stalls with DDP multi-GPU setup · Issue #6569 - GitHub
My training / validation step gets hung when using ddp on 4-GPU AWS instance. Usually it happens at the end of the first...
Read more >
Multi GPU training with DDP - PyTorch
In this tutorial, we start with a single-GPU training script and migrate that to running it on 4 GPUs on a single node....
Read more >
How do i know when training with lightning with ddp strategy ...
I have two gpus on my machine and would like to behave a little differently in the dataloader depending on the gpu the...
Read more >
Data Parallel Troubleshooting - Amazon SageMaker
For more information about the prefix issue, see a discussion thread at Prefix parameter names in saved model if trained by multi-GPU? in...
Read more >
Pytorch Lightning, Distributed Data Parallel & remote - Guild AI
Hi, I've setup my model with Pytorch Lightning, and want to train it ... DDP strategy with multiple GPUs on the remote, the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found