Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stuck running on the SLURM cluster with multiple gpus per node

See original GitHub issue

🐛 Bug

I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:

trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

and submit the job with sbatch run_training.sh . However, I end up with the following output and nothing happens further:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.

run_training.sh

#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

srun python torch_ddp_toy.py

torch_ddp_toy.py

import pytorch_lightning as pl
import torch
from torch import nn

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(100)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)

    trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    trainer.fit(m, train_loader, val_loader)

PyTorch version 1.7.1
PyTorch Lightning version 1.2.0
CentOS Linux release 8.1.1911
PyTorch installed via conda
PyTorch Lightning via pip
slurm 20.02.3

UPDATE: added version of PyTorch Lightning

Issue Analytics

State:
Created 3 years ago
Comments:24 (10 by maintainers)

Top GitHub Comments

2reactions

hkmztrkcommented, Mar 1, 2021

Removing num_nodes argument from training configuration solved the same problem for me.

1reaction

haideraltahancommented, Jun 18, 2021

It requires the env variables from SLURM to be detected
SLURM_JOB_ID
SLURM_PROCID
SLURM_LOCALID
SLURM_NODEID
SLURM_NTASKS
SLURM_NTASKS must match num_noses * num_gpus in the Trainer.

This is what resolved the problem. These variables are important for it to work at least on the SLURM version that my institution is using. Here is the change in my script for allocation that resolved the problem:

#SBATCH --tasks-per-node=4
#SBATCH --mem 185G
#SBATCH --cpus-per-task=8
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Before it was:

#SBATCH --mem 185G
#SBATCH -c 32
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Hope this helps 😄

Top Results From Across the Web

Training stuck running on the SLURM cluster with multiple ...

If I use one gpu per node it looks as expected with registering every member. It seems that the problem is if one...

Run on an on-prem cluster (advanced) - PyTorch Lightning

Run on an on-prem cluster (advanced). Run on a SLURM managed cluster. Lightning automates the details behind training on a SLURM-powered cluster.

Training model on distributed mode with slurm. getting ...

I am running librispeech recipe with distributed mode using slurm on esonet2. i am running on two oracle instance each one has single...

PyTorch - High Performance Computing Facility - UMBC

PyTorch is a GPU/CPU enabled neural network library written in C with native bindings to Python. This tutorial intends to teach you how...

Frequently Asked Questions - Slurm Workload Manager

Why is my MPICH2 or MVAPICH2 job not running with Slurm? ... How can I control the execution of multiple jobs per node?...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Training stuck running on the SLURM cluster with multiple gpus per node

🐛 Bug

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

cli: Confused on (str, int, List[int]) variants for argparse for --gpus flag?

HPC Save Writes Multiple Checkpoints