Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stalls with DDP multi-GPU setup

See original GitHub issue

🐛 Bug

My training / validation step gets hung when using ddp on 4-GPU AWS instance. Usually it happens at the end of the first epoch, but sometimes in the middle of it. Code runs fine on 1 GPU. My model checkpoint is a very basic set up

checkpoint_callback = pl.callbacks.ModelCheckpoint(
        args.checkpointdir,
        save_last=True)

as is the trainer

        trainer = pl.Trainer(
            progress_bar_refresh_rate=1000,
            log_every_n_steps=1000,
            max_epochs=model_config['epochs'],
            gradient_clip_val=0.5,
            gpus=-1,
            accelerator='ddp',
            plugins=[pl.plugins.DDPPlugin(find_unused_parameters=True)],
            callbacks=[checkpoint_callback])

I know there is a related issue https://github.com/PyTorchLightning/pytorch-lightning/issues/4612, but in my case the hanging happens non-deterministically.

Funnily if I use subset of data using --limit_train_batches the trains runs fine. However, I monitor GPU mem usage and it never goes above 91/92%.

Any suggestions would be most appreciated.

Is there a way to at least induce an error message and failure. For example, on AWS SageMaker, stalled model does not fail the job and it continues accumulating costs. I do not want to use other parallel backends as they are much slower making 4-GPU parallelism cost-ineffective.

Expected behavior

Model runs in multi-gpu DDP model without stalling.

Environment

Using AWS p3* instances

CUDA:
- GPU:
  - Tesla V100-SXM2-16GB
- available: True
- version: 10.1
Packages:
- numpy: 1.20.1
- pyTorch_debug: False
- pyTorch_version: 1.4.0 (also tried 1.6.0)
- pytorch-lightning: 1.2.3
- tqdm: 4.57.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.5
- version: #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021

Issue Analytics

State:
Created 3 years ago
Comments:13 (5 by maintainers)

Top GitHub Comments

3reactions

Rilwan-Acommented, Mar 21, 2021

I’ve also faced a problem with code stuck on “initializing ddp”. After much work, I solved my problem by simply adding num_sanity_val_steps=0 as an argument to Trainer(...)

1reaction

NielsRoggecommented, Mar 19, 2021

Same issue, it doesn’t even start for me.

This is my hardware (2 GPU’s on one machine):

!nvidia-smi
Thu Mar 18 13:43:52 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
|  0%   59C    P8    27W / 260W |   2370MiB / 11018MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:02:00.0 Off |                  N/A |
|  0%   58C    P8    23W / 260W |      4MiB / 11019MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9630      C   ...env_datascouts/bin/python     2281MiB |
|    0   N/A  N/A     28915      G   /usr/lib/xorg/Xorg                 27MiB |
|    0   N/A  N/A     28947      G   /usr/bin/gnome-shell               56MiB |
+-----------------------------------------------------------------------------+

My code (the MNIST example from this tutorial):

from pathlib import Path
import requests
import pickle
import gzip

import math
import torch
import torch.nn.functional as F
import pytorch_lightning as pl
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

DATA_PATH = Path("data")
PATH = DATA_PATH / "mnist"

PATH.mkdir(parents=True, exist_ok=True)

URL = "https://github.com/pytorch/tutorials/raw/master/_static/"
FILENAME = "mnist.pkl.gz"

if not (PATH / FILENAME).exists():
        content = requests.get(URL + FILENAME).content
        (PATH / FILENAME).open("wb").write(content)

with gzip.open((PATH / FILENAME).as_posix(), "rb") as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding="latin-1")

x_train, y_train, x_valid, y_valid = map(
    torch.tensor, (x_train, y_train, x_valid, y_valid)
)
n, c = x_train.shape
x_train, x_train.shape, y_train.min(), y_train.max()
print(x_train, y_train)
print(x_train.shape)
print(y_train.min(), y_train.max())

train_ds = TensorDataset(x_train, y_train)

class MNISTModel(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))

    def forward(self, xb):
        return xb @ self.weights + self.bias

    def training_step(self, batch, batch_nb):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

# Init our model
mnist_model = MNISTModel()

train_loader = DataLoader(train_ds, batch_size=32)

# Initialize a trainer
trainer = pl.Trainer(gpus=[0,1], accelerator='ddp', max_epochs=3, progress_bar_refresh_rate=20)

# Train the model ⚡
trainer.fit(mnist_model, train_loader)

However, when running this in a Jupyter notebook, all I get is:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Missing logger folder: /home/(...)/lightning_logs
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2

UPDATE: the following work: