question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed with SLURM

See original GitHub issue

Hi, I am trying to run DeepSpeed on SLURM with multiple nodes and multiple GPUs on each node. I was referring to this example here. I am not sure how we can specify the address of all the machines that will be allotted to me when submitting a SLURM script.

I have my task.sbatch file as this

#SBATCH --nodes=2
#SBATCH --gpus-per-node=8

deepspeed --num_gpus 8 --num_nodes 2 finetune.py --config conf/tutorial-gpt2-micro.yaml

I have two questions

  1. How do I know addresses of machines when specifying the deepspeed parameters?
  2. Why do we need to set lr, scheduler etc. in the config files when the Trainer in the code already has those parameters set for single machine multi-GPU training through Huggingface transformers?

I am a beginner to SLURM based systems so sorry if there is an easy way to specify the addresses. And if there is a SLURM based example of deepspeed, please point me to that.

Thanks in advance!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

4reactions
stas00commented, Jun 16, 2022

I couldn’t make it work with the deepspeed launcher, but it’s working just fine with torch’s launcher, please see how we launch it here:

https://github.com/bigscience-workshop/bigscience/blob/7ccf7e42577fe71e88cf8bed3b9ca965c7afb8f7/train/tr11-176B-ml/tr11-176B-ml.slurm#L172-L217

2reactions
thechargedneutroncommented, Jun 16, 2022

@stas00 Thanks! Torch’s launcher is working perfectly fine! I can now run multi-gpu multi-node code with deepspeed optimization (feeling powerful!). Thanks @mrwyattii @jeffra

Read more comments on GitHub >

github_iconTop Results From Across the Web

CUDA_VISIBLE_DEVICES isn't correctly inherited on ... - GitHub
Describe the bug This issue occurs on a SLURM cluster where worker ... I tried to use the --include flag with the deepspeed...
Read more >
Install Determined on Slurm/PBS
This document describes how to deploy Determined on an HPC cluster managed by the Slurm or PBS workload managers. The Determined master and...
Read more >
Training On Multiple Nodes With DeepSpeed
A user can use DeepSpeed for training with multiple gpu's on one node or many nodes. This tutorial will assume you want to...
Read more >
Getting Started - DeepSpeed
DeepSpeed model training is accomplished using the DeepSpeed engine. The engine can wrap any arbitrary model of type torch.nn.module and has a ...
Read more >
Multi-GPU and multi-node machine learning - Docs CSC
DeepSpeed is an optimization software suite for PyTorch that helps in scaling both training and inference for large deep learning models.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found