Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed with SLURM

See original GitHub issue

Hi, I am trying to run DeepSpeed on SLURM with multiple nodes and multiple GPUs on each node. I was referring to this example here. I am not sure how we can specify the address of all the machines that will be allotted to me when submitting a SLURM script.

I have my task.sbatch file as this

#SBATCH --nodes=2
#SBATCH --gpus-per-node=8

deepspeed --num_gpus 8 --num_nodes 2 finetune.py --config conf/tutorial-gpt2-micro.yaml

I have two questions

How do I know addresses of machines when specifying the deepspeed parameters?
Why do we need to set lr, scheduler etc. in the config files when the Trainer in the code already has those parameters set for single machine multi-GPU training through Huggingface transformers?

I am a beginner to SLURM based systems so sorry if there is an easy way to specify the addresses. And if there is a SLURM based example of deepspeed, please point me to that.

Thanks in advance!

Issue Analytics

State:
Created a year ago
Comments:9 (6 by maintainers)

Top GitHub Comments

4reactions

stas00commented, Jun 16, 2022

I couldn’t make it work with the deepspeed launcher, but it’s working just fine with torch’s launcher, please see how we launch it here:

https://github.com/bigscience-workshop/bigscience/blob/7ccf7e42577fe71e88cf8bed3b9ca965c7afb8f7/train/tr11-176B-ml/tr11-176B-ml.slurm#L172-L217

2reactions

thechargedneutroncommented, Jun 16, 2022

@stas00 Thanks! Torch’s launcher is working perfectly fine! I can now run multi-gpu multi-node code with deepspeed optimization (feeling powerful!). Thanks @mrwyattii @jeffra