Training stuck running on the SLURM cluster with multiple gpus per node
See original GitHub issue🐛 Bug
I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:
trainer = pl.Trainer(
gpus=2, num_nodes=2,
accelerator='ddp',
max_epochs=2
)
and submit the job with sbatch run_training.sh
. However, I end up with the following output and nothing happens further:
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.
run_training.sh
#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G
srun python torch_ddp_toy.py
torch_ddp_toy.py
import pytorch_lightning as pl
import torch
from torch import nn
class Module(pl.LightningModule):
def __init__(self):
super().__init__()
self.linear = nn.Linear(5, 1)
def configure_optimizers(self):
return torch.optim.Adam(self.linear.parameters())
def training_step(self, batch, batch_idx):
return self.linear(batch).sum()
def validation_step(self, batch, batch_idx):
return batch_idx
def validation_epoch_end(self, outputs):
print("VALIDATING", len(outputs))
if __name__ == "__main__":
m = Module()
datasets = [torch.rand([5]) for __ in range(100)]
train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)
trainer = pl.Trainer(
gpus=2, num_nodes=2,
accelerator='ddp',
max_epochs=2
)
trainer.fit(m, train_loader, val_loader)
- PyTorch version 1.7.1
- PyTorch Lightning version 1.2.0
- CentOS Linux release 8.1.1911
- PyTorch installed via conda
- PyTorch Lightning via pip
- slurm 20.02.3
UPDATE: added version of PyTorch Lightning
Issue Analytics
- State:
- Created 3 years ago
- Comments:24 (10 by maintainers)
Top Results From Across the Web
Training stuck running on the SLURM cluster with multiple ...
If I use one gpu per node it looks as expected with registering every member. It seems that the problem is if one...
Read more >Run on an on-prem cluster (advanced) - PyTorch Lightning
Run on an on-prem cluster (advanced). Run on a SLURM managed cluster. Lightning automates the details behind training on a SLURM-powered cluster.
Read more >Training model on distributed mode with slurm. getting ...
I am running librispeech recipe with distributed mode using slurm on esonet2. i am running on two oracle instance each one has single...
Read more >PyTorch - High Performance Computing Facility - UMBC
PyTorch is a GPU/CPU enabled neural network library written in C with native bindings to Python. This tutorial intends to teach you how...
Read more >Frequently Asked Questions - Slurm Workload Manager
Why is my MPICH2 or MVAPICH2 job not running with Slurm? ... How can I control the execution of multiple jobs per node?...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Removing
num_nodes
argument from training configuration solved the same problem for me.This is what resolved the problem. These variables are important for it to work at least on the SLURM version that my institution is using. Here is the change in my script for allocation that resolved the problem:
Before it was:
Hope this helps 😄