question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

stuck at "Initializing distributed.." when using ddp with multiple gpus

See original GitHub issue

First check

  • I’m sure this is a bug.
  • I’ve added a descriptive title to this bug.
  • I’ve provided clear instructions on how to reproduce the bug.
  • I’ve added a code sample.
  • I’ve provided any other important info that is required.

Bug description

Dear community,

I’m desperately trying to achieve multi gpu training on our scientific SLURM cluster. It has one gpu (Tesla T4) per node, so specifically I want to achieve multi node multi gpu training. For testing I just use a minimal example from PL documentation (https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html) to ensure that the error is not build in my model. Even when allocating only 1 gpu the code hangs. When using “dp” instead of “ddp” it runs (though not really faster).

How to reproduce the bug

MODELL:
import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
    test_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        limit_train_batches=1,
        limit_val_batches=1,
        limit_test_batches=1,
        num_sanity_val_steps=0,
        max_epochs=1,
        enable_model_summary=False,
        accelerator="gpu",
        devices=1,
        strategy="ddp",
        num_nodes=4
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
    trainer.test(model, dataloaders=test_data)


if __name__ == "__main__":
    run()


SLURM SUBMISSION SCRIPT:
#!/bin/bash
#SBATCH -J test
#SBATCH -e test.err
#SBATCH -o test.log
#SBATCH --tasks-per-node 8
#SBATCH --gres gpu:1
#SBATCH --nodes 4

SECONDS=0
python train.py
echo "$SECONDS seconds passed"

Error messages and logs

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
`Trainer(limit_test_batches=1)` was configured so 1 batch will be used.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4

Important info

cuda 11.2
torch 1.12.1
pytorch-lightning 1.7.6

More info

I tried out various things from many threads but still can’t even get this minimal example to work 😦

This is my first bug report ever, please don’t hate and thanks a lot in advance!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
awaelchlicommented, Sep 20, 2022

Hey @FlorianWieser1 Our SLURM docs here have a template for the slurm submission script. One thing which is very important is that the number of nodes and processes configured there need to match with what is in the Trainer! This mismatch is the most likely cause why it gets stuck. Try to change it:

#!/bin/bash
#SBATCH -J test
#SBATCH -e test.err
#SBATCH -o test.log
#SBATCH --nodes=4   <--- MUST MATCH num_nodes IN TRAINER
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1   <--- MUST MATCH devices IN TRAINER
Trainer(accelerator="gpu", devices=1, strategy="ddp", num_nodes=4)

Also, please pay close attention on how to invoke the script, it should be done with srun. You have

python train.py

But it should be

srun python train.py

I will try to make it more clear in the docs. I also have this related proposal open #10150

1reaction
FlorianWieser1commented, Sep 26, 2022

Haha I see thats a Problem 😂

I don’t know to be honest, I’ve been on this documentation side you linked mulitple time, but still I overlooked or forgot about the “srun”. I’m not new to SLUM so I knew about “srun”. I guess the documentation is fine and this was really my fault 😃 Before I came here I randomly searched the internet for things like “slurm pytorch lightning ddp multi node” etc.

Best, Florian

Read more comments on GitHub >

github_iconTop Results From Across the Web

Code stuck on "initalizing ddp" when using more than one gpu
Bug I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify pl.Trainer(gpus=[0]) It runs...
Read more >
CUDA OOM while initializing DDP - PyTorch Lightning
I get a CUDA OOM error when I try to train the model using this trainer configuration: trainer = pl.Trainer( max_epochs=10, gpus=[2, 3], ......
Read more >
Pytorch Lightning, Distributed Data Parallel & remote - Guild AI
I've setup my model with Pytorch Lightning, and want to train it on a remote workstation (single machine, multiple GPUs) using multiple GPUs....
Read more >
jean-zay-users/jean-zay-doc - Gitter
I'm having an issue whenever I want to use the DistributedDataParallel (DDP) mode available for multi-gpu implementations. There are multiple scenarios but ...
Read more >
Getting Started with Distributed Data Parallel - PyTorch
Applications using DDP should spawn multiple processes and create a ... fit on a single GPU, you must use model parallel to split...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found