DeepSpeed launch jobs with Kubernetes
See original GitHub issueI’m hoping to launch DeepSpeed-based multi-node training jobs with Kubernetes, and for this purpose, I’m currently working on writing YAML files for my jobs and building a custom Docker image with all needed third-party dependencies (including DeepSpeed).
I see in the resource-configuration-multi-node section of the docs that DeepSpeed typically requires a hostfile for multi-node training, but that kinda defeats the purpose of using Kubernetes.
What is a good way to enable DeepSpeed-based multi-node training with Kubernetes?
I see in the mpi-compatibility section that DeepSpeed is compatible with mpirun.
I quote the docs:
DeepSpeed will then use mpi4py to discover the MPI environment (e.g., rank, world size) and properly initialize torch distributed for training. In this case you will explicitly invoke python to launch your model script instead of using the deepspeed launcher, here is an example:
mpirun <mpi-args> python \ <client_entry.py> <client args> \ --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json
Is mpirun the solution for training with Kubernetes? What kind of args need to be passed as <mpi-args>
above?
Do you have any sample multi-node training job YAML files and Dockerfiles that you could share? Could you summarize a checklist of things to do to the training code, environment variables to set, etc. (I may be missing something here) to make it ready for Kubernetes-based multi-node training?
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (7 by maintainers)
Top GitHub Comments
FYI @jeffra - we’ve managed to use DeepSpeed on Kubernetes with OpenMPI and the MPI Operator using DeepSpeed’s Dockerfile with added OpenMPI and SSH configs: https://gist.github.com/yochze/e6954524da6bc0a080c43015dead4903
No need for hosts file/special configurations
@jeffra thanks for your reply! I managed to get this working with mpirun and deepspeed!
On a somewhat unrelated note, could you please elaborate on the need for each of the highlighted dependencies in deepspeed’s Dockerfile?