DeepSpeed with SLURM
See original GitHub issueHi, I am trying to run DeepSpeed on SLURM with multiple nodes and multiple GPUs on each node. I was referring to this example here. I am not sure how we can specify the address of all the machines that will be allotted to me when submitting a SLURM script.
I have my task.sbatch file as this
#SBATCH --nodes=2
#SBATCH --gpus-per-node=8
deepspeed --num_gpus 8 --num_nodes 2 finetune.py --config conf/tutorial-gpt2-micro.yaml
I have two questions
- How do I know addresses of machines when specifying the deepspeed parameters?
- Why do we need to set lr, scheduler etc. in the config files when the Trainer in the code already has those parameters set for single machine multi-GPU training through Huggingface transformers?
I am a beginner to SLURM based systems so sorry if there is an easy way to specify the addresses. And if there is a SLURM based example of deepspeed, please point me to that.
Thanks in advance!
Issue Analytics
- State:
- Created a year ago
- Comments:9 (6 by maintainers)
Top Results From Across the Web
CUDA_VISIBLE_DEVICES isn't correctly inherited on ... - GitHub
Describe the bug This issue occurs on a SLURM cluster where worker ... I tried to use the --include flag with the deepspeed...
Read more >Install Determined on Slurm/PBS
This document describes how to deploy Determined on an HPC cluster managed by the Slurm or PBS workload managers. The Determined master and...
Read more >Training On Multiple Nodes With DeepSpeed
A user can use DeepSpeed for training with multiple gpu's on one node or many nodes. This tutorial will assume you want to...
Read more >Getting Started - DeepSpeed
DeepSpeed model training is accomplished using the DeepSpeed engine. The engine can wrap any arbitrary model of type torch.nn.module and has a ...
Read more >Multi-GPU and multi-node machine learning - Docs CSC
DeepSpeed is an optimization software suite for PyTorch that helps in scaling both training and inference for large deep learning models.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I couldn’t make it work with the
deepspeed
launcher, but it’s working just fine with torch’s launcher, please see how we launch it here:https://github.com/bigscience-workshop/bigscience/blob/7ccf7e42577fe71e88cf8bed3b9ca965c7afb8f7/train/tr11-176B-ml/tr11-176B-ml.slurm#L172-L217
@stas00 Thanks! Torch’s launcher is working perfectly fine! I can now run multi-gpu multi-node code with deepspeed optimization (feeling powerful!). Thanks @mrwyattii @jeffra