multiple deepspeed runs in a single machine
See original GitHub issueHi.
I have a 8-gpu local machine and trying to run using deepspeed 2 separate experiments with 4 gpus for each. Also, I assigned two different master ports for each experiment like
run 1
deepspeed --include=localhost:0,1,2,3 --master_port 61000 train.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json --dataset ...
run 2
deepspeed --include=localhost:4,5,6,7 --master_port 60000 train.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json --dataset ...
However, I find that one of the runs cannot be started with the errors like RuntimeError: Address already in use
. Also, I checked if the ports are available and tried different port numbers as well. But still I couldn’t make it work.
Do you have any ideas for this issue ?
(FYI, I’m using a docker image which was downloaded via docker pull deepspeed/deepspeed
)
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
@tjruwase
That was fast! I actually resolved the issue already by specifying the needed
--master_port
argument. Sorry for the bother. Seems I had the argument placed after the--include
arg, which didn’t work (causing a different error than the one listed here).@afiaka87, @IndexFziQ
Please see here for instructions on creating hostfile. Also, it might be open a new ticket. The reason is that issue was closed as the original appeared to have been solved, and the code base and docs have changed significantly since then. Thanks!