question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Launcher not registering the user_script as argument.

See original GitHub issue

Hello, I’m trying to run a basic multi-node DeepSpeed setup on a pod.

When I run deepspeed --hostfile=myhostfile basic_deepspeed.py, I’m getting

[2022-12-15 21:00:19,543] [INFO] [runner.py:417:main] Using IP address of  for node ddp-0.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,544] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local
[2022-12-15 21:00:19,545] [INFO] [runner.py:508:main] cmd = pdsh -S -f 1024 -w ddp-0.ddp.ml-dev.svc.cluster.local,ddp-1.ddp.ml-dev.svc.cluster.local export PYTHON_VERSION=3.9.13; export PYTHON_SETUPTOOLS_VERSION=58.1.0; export PYTHON_PIP_VERSION=22.0.4; export PYTHON_GET_PIP_SHA256=5aefe6ade911d997af080b315ebcb7f882212d070465df544e1175ac2be519b4; export PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/5eaac1050023df1f5c98b173b248c260023f2278/public/get-pip.py; export PYTHONPATH=/;  cd /; /usr/local/bin/python -u -m deepspeed.launcher.launch --world_info=eyJkZHAtMC5kZHAubWwtZGV2LnN2Yy5jbHVzdGVyLmxvY2FsIjogWzAsIDFdLCAiZGRwLTEuZGRwLm1sLWRldi5zdmMuY2x1c3Rlci5sb2NhbCI6IFswLCAxXX0= --node_rank=%n --master_addr= --master_port=29500 scripts/basic_deepspeed.py
ddp-0: 
ddp-0:          _              _            _             _            _       _    _
ddp-0:         /\_\           /\ \         /\ \     _    / /\         / /\    / /\ /\ \
ddp-0:        / / /  _       /  \ \       /  \ \   /\_\ / /  \       / / /   / / //  \ \
ddp-0:       / / /  /\_\    / /\ \ \     / /\ \ \_/ / // / /\ \__   / /_/   / / // /\ \ \
ddp-0:      / / /__/ / /   / / /\ \_\   / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-0:     / /\_____/ /   / /_/_ \/_/  / / /  \/____/ \ \ \ \/___// /\ \___\/ // / /  \ \_\
ddp-0:    / /\_______/   / /____/\    / / /    / / /   \ \ \     / / /\/___/ // / /   / / /
ddp-0:   / / /\ \ \     / /\____\/   / / /    / / /_    \ \ \   / / /   / / // / /   / / /
ddp-0:  / / /  \ \ \   / / /______  / / /    / / //_/\__/ / /  / / /   / / // / /___/ / /
ddp-0: / / /    \ \ \ / / /_______\/ / /    / / / \ \/___/ /  / / /   / / // / /____\/ /
ddp-0: \/_/      \_\_\\/__________/\/_/     \/_/   \_____\/   \/_/    \/_/ \/_________/
ddp-0: 
ddp-0: 
ddp-0: 
ddp-1: 
ddp-1:          _              _            _             _            _       _    _
ddp-1:         /\_\           /\ \         /\ \     _    / /\         / /\    / /\ /\ \
ddp-1:        / / /  _       /  \ \       /  \ \   /\_\ / /  \       / / /   / / //  \ \
ddp-1:       / / /  /\_\    / /\ \ \     / /\ \ \_/ / // / /\ \__   / /_/   / / // /\ \ \
ddp-1:      / / /__/ / /   / / /\ \_\   / / /\ \___/ // / /\ \___\ / /\ \__/ / // / /\ \ \
ddp-1:     / /\_____/ /   / /_/_ \/_/  / / /  \/____/ \ \ \ \/___// /\ \___\/ // / /  \ \_\
ddp-1:    / /\_______/   / /____/\    / / /    / / /   \ \ \     / / /\/___/ // / /   / / /
ddp-1:   / / /\ \ \     / /\____\/   / / /    / / /_    \ \ \   / / /   / / // / /   / / /
ddp-1:  / / /  \ \ \   / / /______  / / /    / / //_/\__/ / /  / / /   / / // / /___/ / /
ddp-1: / / /    \ \ \ / / /_______\/ / /    / / / \ \/___/ /  / / /   / / // / /____\/ /
ddp-1: \/_/      \_\_\\/__________/\/_/     \/_/   \_____\/   \/_/    \/_/ \/_________/
ddp-1: 
ddp-1: 
ddp-1: 
ddp-0: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-0:                  [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-0:                  [--module] [--no_python] [--enable_elastic_training]
ddp-0:                  [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-0:                  [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-0:                  [--save_pid SAVE_PID]
ddp-0:                  [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-0:                  training_script ...
ddp-0: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-1: usage: launch.py [-h] [--node_rank NODE_RANK] [--master_addr MASTER_ADDR]
ddp-1:                  [--master_port MASTER_PORT] [--world_info WORLD_INFO]
ddp-1:                  [--module] [--no_python] [--enable_elastic_training]
ddp-1:                  [--min_elastic_nodes MIN_ELASTIC_NODES]
ddp-1:                  [--max_elastic_nodes MAX_ELASTIC_NODES] [--no_local_rank]
ddp-1:                  [--save_pid SAVE_PID]
ddp-1:                  [--enable_each_rank_log ENABLE_EACH_RANK_LOG]
ddp-1:                  training_script ...
ddp-1: launch.py: error: the following arguments are required: training_script, training_script_args
ddp-0: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-0: ssh exited with exit code 127
ddp-1: bash: line 1: 31m: command not found
pdsh@ddp-0: ddp-1: ssh exited with exit code 127

It successfully connects to both pods (named ddp-0 and ddp-1) using ssh, but for some reason runner.py doesn’t pass my script successfully to launch.py. Any ideas why?

I’m running in Debian 11 with torch 1.13.0 and deepspeed 0.7.7.

Issue Analytics

  • State:closed
  • Created 9 months ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dogacancolakcommented, Dec 20, 2022

That does not work unfortunately. This banner might not be related to ssh on a second thought. it’s the linux login banner. I like the regex idea though.

1reaction
jeffracommented, Dec 20, 2022

Ooh haha that makes a lot more sense now on what’s going on, the way we are parsing the output is super basic here. We could probably change this to use a regex that matches the first ip address found. I’ll change #2631 to do this

Read more comments on GitHub >

github_iconTop Results From Across the Web

User scripts component · Issue #792 · bromite ... - GitHub
uhm, I'm working on it now because it's a key part. it is not very long (with cut&paste cr code), of course if...
Read more >
How can I load a shared web worker with a user-script?
You can use fetch() , response.blob() to create an Blob URL of type application/javascript from returned Blob ; set SharedWorker() parameter ...
Read more >
Creating TamperMonkey Userscripts | Augmented Browsing
World Wide Web is amazing. There is a lot of content to be browsed. And to browse it, we use a web browser....
Read more >
App-V 5 DeploymentConfig.xml UserScripts. Is it possible to ...
Hi,. I have an application where the users need to be able to run a different pre-launch script depending upon the configuration they...
Read more >
GreaseMonkey Script Update Control - SylvainAirCarnet
Brand new method may be not working : @require · ad the @require parameter in the header of your script // @require http://userscripts.org/ ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found