Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

multiple deepspeed runs in a single machine

See original GitHub issue

Hi.

I have a 8-gpu local machine and trying to run using deepspeed 2 separate experiments with 4 gpus for each. Also, I assigned two different master ports for each experiment like

run 1 deepspeed --include=localhost:0,1,2,3 --master_port 61000 train.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json --dataset ...

run 2 deepspeed --include=localhost:4,5,6,7 --master_port 60000 train.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json --dataset ...

However, I find that one of the runs cannot be started with the errors like RuntimeError: Address already in use. Also, I checked if the ports are available and tried different port numbers as well. But still I couldn’t make it work.

Do you have any ideas for this issue ?

(FYI, I’m using a docker image which was downloaded via docker pull deepspeed/deepspeed)

Thanks!

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

2reactions

afiaka87commented, Mar 14, 2022

@tjruwase

That was fast! I actually resolved the issue already by specifying the needed --master_port argument. Sorry for the bother. Seems I had the argument placed after the --include arg, which didn’t work (causing a different error than the one listed here).

2reactions

tjruwasecommented, Mar 14, 2022

@afiaka87, @IndexFziQ

Please see here for instructions on creating hostfile. Also, it might be open a new ticket. The reason is that issue was closed as the original appeared to have been solved, and the code base and docs have changed significantly since then. Thanks!