question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DeepSpeed launch jobs with Kubernetes

See original GitHub issue

I’m hoping to launch DeepSpeed-based multi-node training jobs with Kubernetes, and for this purpose, I’m currently working on writing YAML files for my jobs and building a custom Docker image with all needed third-party dependencies (including DeepSpeed).

I see in the resource-configuration-multi-node section of the docs that DeepSpeed typically requires a hostfile for multi-node training, but that kinda defeats the purpose of using Kubernetes.

What is a good way to enable DeepSpeed-based multi-node training with Kubernetes?

I see in the mpi-compatibility section that DeepSpeed is compatible with mpirun.

I quote the docs:

DeepSpeed will then use mpi4py to discover the MPI environment (e.g., rank, world size) and properly initialize torch distributed for training. In this case you will explicitly invoke python to launch your model script instead of using the deepspeed launcher, here is an example:

mpirun <mpi-args> python \ <client_entry.py> <client args> \ --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json

Is mpirun the solution for training with Kubernetes? What kind of args need to be passed as <mpi-args> above?

Do you have any sample multi-node training job YAML files and Dockerfiles that you could share? Could you summarize a checklist of things to do to the training code, environment variables to set, etc. (I may be missing something here) to make it ready for Kubernetes-based multi-node training?

Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (7 by maintainers)

github_iconTop GitHub Comments

5reactions
yochzecommented, Jul 27, 2020

FYI @jeffra - we’ve managed to use DeepSpeed on Kubernetes with OpenMPI and the MPI Operator using DeepSpeed’s Dockerfile with added OpenMPI and SSH configs: https://gist.github.com/yochze/e6954524da6bc0a080c43015dead4903

No need for hosts file/special configurations

1reaction
g-karthikcommented, Jun 30, 2020

@jeffra thanks for your reply! I managed to get this working with mpirun and deepspeed!

On a somewhat unrelated note, could you please elaborate on the need for each of the highlighted dependencies in deepspeed’s Dockerfile?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started - DeepSpeed
The following command launches a PyTorch training job across all available nodes and GPUs specified in myhostfile :.
Read more >
Training On Multiple Nodes With DeepSpeed
DeepSpeed will look for the hostfile at /job/hostfile on machine1 if a ... The following command (run on machine1) will launch training across...
Read more >
Launching GPT DeepSpeed Models using DeterminedAI
Launch a GPT DeepSpeed model using DeterminedAI on CoreWeave Cloud. ... a minimal GPT-NeoX DeepSpeed distributed training job is launched ...
Read more >
DeepSpeed Integration - Hugging Face
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >
Running a Job on the Cluster — Gaudi Documentation
kubectl apply -f job-hl.yaml · kubectl get pods -A · kubectl logs <pod-name>
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found