Launcher not registering the user_script as argument.
See original GitHub issueHello, I’m trying to run a basic multi-node DeepSpeed setup on a pod.
When I run deepspeed --hostfile=myhostfile basic_deepspeed.py
, I’m getting
[2022-12-15 21:00:19,543] [INFO] [runner.py:417:main] Using IP address of for node ddp-0.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,544] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,545] [INFO] [runner.py:508:main] cmd = pdsh -S -f 1024 -w ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local export PYTHON_VERSION=3.9.13; export PYTHON_SETUPTOOLS_VERSION=58.1.0; export PYTHON_PIP_VERSION=22.0.4; export PYTHON_GET_PIP_SHA256=5aefe6ade911d997af080b315ebcb7f882212d070465df544e1175ac2be519b4; export PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/5eaac1050023df1f5c98b173b248c260023f2278/public/get-pip.py; export PYTHONPATH=/; cd /; /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJkZHAtMC5kZHAubWwtZGV2LnN2Yy5jbHVzdGVyLmxvY2FsIjogWzAsIDFdLCAiZGRwLTEuZGRwLm1sLWRldi5zdmMuY2x1c3Rlci5sb2NhbCI6IFswLCAxXX0= --node_rank=%n --master_addr= --master_port=29500 scripts/basic_deepspeed.py
ddp-0:
ddp-0: _ _ _ _ _ _ _
ddp-0: /\_\ /\ \ /\ \ _ / /\ / /\ / /\ /\ \
ddp-0: / / / _ / \ \ / \ \ /\_\ / / \ / / / / / // \ \
ddp-0: / / / /\_\ / /\ \ \ / /\ \ \_/ / // / /\ \__ / /_/ / / // /\ \ \
ddp-0: / / /__/ / / / / /\ \_\ / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-0: / /\_____/ / / /_/_ \/_/ / / / \/____/ \ \ \ \/___// /\ \___\/ // / / \ \_\
ddp-0: / /\_______/ / /____/\ / / / / / / \ \ \ / / /\/___/ // / / / / /
ddp-0: / / /\ \ \ / /\____\/ / / / / / /_ \ \ \ / / / / / // / / / / /
ddp-0: / / / \ \ \ / / /______ / / / / / //_/\__/ / / / / / / / // / /___/ / /
ddp-0: / / / \ \ \ / / /_______\/ / / / / / \ \/___/ / / / / / / // / /____\/ /
ddp-0: \/_/ \_\_\\/__________/\/_/ \/_/ \_____\/ \/_/ \/_/ \/_________/
ddp-0:
ddp-0:
ddp-0:
ddp-1:
ddp-1: _ _ _ _ _ _ _
ddp-1: /\_\ /\ \ /\ \ _ / /\ / /\ / /\ /\ \
ddp-1: / / / _ / \ \ / \ \ /\_\ / / \ / / / / / // \ \
ddp-1: / / / /\_\ / /\ \ \ / /\ \ \_/ / // / /\ \__ / /_/ / / // /\ \ \
ddp-1: / / /__/ / / / / /\ \_\ / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-1: / /\_____/ / / /_/_ \/_/ / / / \/____/ \ \ \ \/___// /\ \___\/ // / / \ \_\
ddp-1: / /\_______/ / /____/\ / / / / / / \ \ \ / / /\/___/ // / / / / /
ddp-1: / / /\ \ \ / /\____\/ / / / / / /_ \ \ \ / / / / / // / / / / /
ddp-1: / / / \ \ \ / / /______ / / / / / //_/\__/ / / / / / / / // / /___/ / /
ddp-1: / / / \ \ \ / / /_______\/ / / / / / \ \/___/ / / / / / / // / /____\/ /
ddp-1: \/_/ \_\_\\/__________/\/_/ \/_/ \_____\/ \/_/ \/_/ \/_________/
ddp-1:
ddp-1:
ddp-1:
ddp-0: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-0: [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-0: [--module] [--no_python] [--enable_elastic_training]
ddp-0: [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-0: [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-0: [--save_pid SAVE_PID]
ddp-0: [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-0: training_script ...
ddp-0: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-1: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-1: [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-1: [--module] [--no_python] [--enable_elastic_training]
ddp-1: [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-1: [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-1: [--save_pid SAVE_PID]
ddp-1: [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-1: training_script ...
ddp-1: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-0: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-0: ssh exited with exit code 127
ddp-1: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-1: ssh exited with exit code 127
It successfully connects to both pods (named ddp-0 and ddp-1) using ssh, but for some reason runner.py
doesn’t pass my script successfully to launch.py
. Any ideas why?
I’m running in Debian 11 with torch 1.13.0 and deepspeed 0.7.7.
Issue Analytics
- State:
- Created 9 months ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
User scripts component · Issue #792 · bromite ... - GitHub
uhm, I'm working on it now because it's a key part. it is not very long (with cut&paste cr code), of course if...
Read more >How can I load a shared web worker with a user-script?
You can use fetch() , response.blob() to create an Blob URL of type application/javascript from returned Blob ; set SharedWorker() parameter ...
Read more >Creating TamperMonkey Userscripts | Augmented Browsing
World Wide Web is amazing. There is a lot of content to be browsed. And to browse it, we use a web browser....
Read more >App-V 5 DeploymentConfig.xml UserScripts. Is it possible to ...
Hi,. I have an application where the users need to be able to run a different pre-launch script depending upon the configuration they...
Read more >GreaseMonkey Script Update Control - SylvainAirCarnet
Brand new method may be not working : @require · ad the @require parameter in the header of your script // @require http://userscripts.org/ ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That does not work unfortunately. This banner might not be related to ssh on a second thought. it’s the linux login banner. I like the regex idea though.
Ooh haha that makes a lot more sense now on what’s going on, the way we are parsing the output is super basic here. We could probably change this to use a regex that matches the first ip address found. I’ll change #2631 to do this