stuck at "Initializing distributed.." when using ddp with multiple gpus
See original GitHub issueFirst check
- I’m sure this is a bug.
- I’ve added a descriptive title to this bug.
- I’ve provided clear instructions on how to reproduce the bug.
- I’ve added a code sample.
- I’ve provided any other important info that is required.
Bug description
Dear community,
I’m desperately trying to achieve multi gpu training on our scientific SLURM cluster. It has one gpu (Tesla T4) per node, so specifically I want to achieve multi node multi gpu training. For testing I just use a minimal example from PL documentation (https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html) to ensure that the error is not build in my model. Even when allocating only 1 gpu the code hangs. When using “dp” instead of “ddp” it runs (though not really faster).
How to reproduce the bug
MODELL:
import os
import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
class BoringModel(LightningModule):
def __init__(self):
super().__init__()
self.layer = torch.nn.Linear(32, 2)
def forward(self, x):
return self.layer(x)
def training_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("train_loss", loss)
return {"loss": loss}
def validation_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("valid_loss", loss)
def test_step(self, batch, batch_idx):
loss = self(batch).sum()
self.log("test_loss", loss)
def configure_optimizers(self):
return torch.optim.SGD(self.layer.parameters(), lr=0.1)
def run():
train_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
val_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
test_data = DataLoader(RandomDataset(32, 64), batch_size=2, num_workers=4)
model = BoringModel()
trainer = Trainer(
default_root_dir=os.getcwd(),
limit_train_batches=1,
limit_val_batches=1,
limit_test_batches=1,
num_sanity_val_steps=0,
max_epochs=1,
enable_model_summary=False,
accelerator="gpu",
devices=1,
strategy="ddp",
num_nodes=4
)
trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)
trainer.test(model, dataloaders=test_data)
if __name__ == "__main__":
run()
SLURM SUBMISSION SCRIPT:
#!/bin/bash
#SBATCH -J test
#SBATCH -e test.err
#SBATCH -o test.log
#SBATCH --tasks-per-node 8
#SBATCH --gres gpu:1
#SBATCH --nodes 4
SECONDS=0
python train.py
echo "$SECONDS seconds passed"
Error messages and logs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(limit_train_batches=1)` was configured so 1 batch per epoch will be used.
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.
`Trainer(limit_test_batches=1)` was configured so 1 batch will be used.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Important info
cuda 11.2
torch 1.12.1
pytorch-lightning 1.7.6
More info
I tried out various things from many threads but still can’t even get this minimal example to work 😦
This is my first bug report ever, please don’t hate and thanks a lot in advance!
Issue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Code stuck on "initalizing ddp" when using more than one gpu
Bug I am trying to run a pytorch lightning model on a 4-GPU node. In my trainer, if I specify pl.Trainer(gpus=[0]) It runs...
Read more >CUDA OOM while initializing DDP - PyTorch Lightning
I get a CUDA OOM error when I try to train the model using this trainer configuration: trainer = pl.Trainer( max_epochs=10, gpus=[2, 3], ......
Read more >Pytorch Lightning, Distributed Data Parallel & remote - Guild AI
I've setup my model with Pytorch Lightning, and want to train it on a remote workstation (single machine, multiple GPUs) using multiple GPUs....
Read more >jean-zay-users/jean-zay-doc - Gitter
I'm having an issue whenever I want to use the DistributedDataParallel (DDP) mode available for multi-gpu implementations. There are multiple scenarios but ...
Read more >Getting Started with Distributed Data Parallel - PyTorch
Applications using DDP should spawn multiple processes and create a ... fit on a single GPU, you must use model parallel to split...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hey @FlorianWieser1 Our SLURM docs here have a template for the slurm submission script. One thing which is very important is that the number of nodes and processes configured there need to match with what is in the Trainer! This mismatch is the most likely cause why it gets stuck. Try to change it:
Also, please pay close attention on how to invoke the script, it should be done with
srun
. You haveBut it should be
I will try to make it more clear in the docs. I also have this related proposal open #10150
Haha I see thats a Problem 😂
I don’t know to be honest, I’ve been on this documentation side you linked mulitple time, but still I overlooked or forgot about the “srun”. I’m not new to SLUM so I knew about “srun”. I guess the documentation is fine and this was really my fault 😃 Before I came here I randomly searched the internet for things like “slurm pytorch lightning ddp multi node” etc.
Best, Florian