question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multiple deepspeed runs in a single machine

See original GitHub issue

Hi.

I have a 8-gpu local machine and trying to run using deepspeed 2 separate experiments with 4 gpus for each. Also, I assigned two different master ports for each experiment like

run 1 deepspeed --include=localhost:0,1,2,3 --master_port 61000 train.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json --dataset ...

run 2 deepspeed --include=localhost:4,5,6,7 --master_port 60000 train.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json --dataset ...

However, I find that one of the runs cannot be started with the errors like RuntimeError: Address already in use. Also, I checked if the ports are available and tried different port numbers as well. But still I couldn’t make it work.

Do you have any ideas for this issue ?

(FYI, I’m using a docker image which was downloaded via docker pull deepspeed/deepspeed)

Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
afiaka87commented, Mar 14, 2022

@tjruwase

That was fast! I actually resolved the issue already by specifying the needed --master_port argument. Sorry for the bother. Seems I had the argument placed after the --include arg, which didn’t work (causing a different error than the one listed here).

2reactions
tjruwasecommented, Mar 14, 2022

@afiaka87, @IndexFziQ

Please see here for instructions on creating hostfile. Also, it might be open a new ticket. The reason is that issue was closed as the original appeared to have been solved, and the code base and docs have changed significantly since then. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting Started - DeepSpeed
First steps with DeepSpeed.
Read more >
Training On Multiple Nodes With DeepSpeed
This tutorial will assume you want to train on multiple nodes. One essential configuration for DeepSpeed is the hostfile, which contains lists of...
Read more >
DeepSpeed Integration - Hugging Face
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...
Read more >
A Gentle Introduction to Distributed Training with DeepSpeed
With minimal code changes, a developer can train a model on a single GPU machine, a single machine with multiple GPUs, or on...
Read more >
Distributed GPU Training | Azure Machine Learning
To run distributed training with the DeepSpeed library on Azure ML, do not use ... If you are using the launch utility to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found