Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed launch jobs with Kubernetes

See original GitHub issue

I’m hoping to launch DeepSpeed-based multi-node training jobs with Kubernetes, and for this purpose, I’m currently working on writing YAML files for my jobs and building a custom Docker image with all needed third-party dependencies (including DeepSpeed).

I see in the resource-configuration-multi-node section of the docs that DeepSpeed typically requires a hostfile for multi-node training, but that kinda defeats the purpose of using Kubernetes.

What is a good way to enable DeepSpeed-based multi-node training with Kubernetes?

I see in the mpi-compatibility section that DeepSpeed is compatible with mpirun.

I quote the docs:

DeepSpeed will then use mpi4py to discover the MPI environment (e.g., rank, world size) and properly initialize torch distributed for training. In this case you will explicitly invoke python to launch your model script instead of using the deepspeed launcher, here is an example:

mpirun <mpi-args> python \ <client_entry.py> <client args> \ --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json

Is mpirun the solution for training with Kubernetes? What kind of args need to be passed as <mpi-args> above?

Do you have any sample multi-node training job YAML files and Dockerfiles that you could share? Could you summarize a checklist of things to do to the training code, environment variables to set, etc. (I may be missing something here) to make it ready for Kubernetes-based multi-node training?

Thanks!

Issue Analytics

State:
Created 3 years ago
Comments:16 (7 by maintainers)

Top GitHub Comments

5reactions

yochzecommented, Jul 27, 2020

FYI @jeffra - we’ve managed to use DeepSpeed on Kubernetes with OpenMPI and the MPI Operator using DeepSpeed’s Dockerfile with added OpenMPI and SSH configs: https://gist.github.com/yochze/e6954524da6bc0a080c43015dead4903

No need for hosts file/special configurations

1reaction

g-karthikcommented, Jun 30, 2020

@jeffra thanks for your reply! I managed to get this working with mpirun and deepspeed!

On a somewhat unrelated note, could you please elaborate on the need for each of the highlighted dependencies in deepspeed’s Dockerfile?