question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] OpenMPI backend doesn't support custom mpi launch args

See original GitHub issue

Describe the bug In our business, we need to specify a custom mpi port rather than default 22 to communicate, in normal mpi, we can pass -mca plm_rsh_args -p 5000 to specify the port, but deepspeed not work.

I looked into the source code, found OpenMPI multi-node runner doesn’t use the launcher_args passed by command --launcher_args

Expected behavior deepspeed bin accept launcher_args and pass them to openMPI launcher

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.6/dist-packages/torch']
torch version .................... 1.10.0+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/usr/local/lib/python3.6/dist-packages/deepspeed']
deepspeed info ................... 0.5.6, unknown, unknown
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

System info (please complete the following information):

  • OS: Ubuntu 18.04.6 LTS
  • GPU count and types : dynamic number of nodes, V100 or A100
  • Interconnects (if applicable) [not clear]
  • Python version: 3.6.9

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? launch cmd like:

deepspeed --master_addr ${master_addr} --master_port 1234 --hostfile ${HOST_FILE} \
       --launcher OpenMPI \
       --launcher_args "--allow-run-as-root -mca plm_rsh_args -p 5000" \
       {user_script} {user_script_args}

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
flyhighzycommented, May 13, 2022

Now it works! Thanks for the update

1reaction
flyhighzycommented, May 10, 2022

maybe it’s not that simple. If I pass launcher args as “-mca plm_rsh_args -p 5000”, it will be splited as ["-mca", "plm_rsh_args", "-p", "5000"], but expected is ["-mca", "plm_rsh_args", "-p 5000"]

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAQ: Building Open MPI
How do I build Open MPI with CUDA-aware support? How do I not build a specific plugin / component for Open MPI? What...
Read more >
FAQ: Compiling MPI applications - Open MPI
The Open MPI team strongly recommends that you simply use Open MPI's "wrapper" compilers to compile your MPI applications.
Read more >
FAQ: Troubleshooting building and running MPI jobs - Open MPI
Open MPI tells me that it fails to load components with a "file not found" error — but the file is there! Why...
Read more >
17.8. Open MPI v1.x series
Fix a bus error on MPI_WIN_[POST,START] in the shared memory one-sided component. Add several missing MPI_WIN_FLAVOR constants to the Fortran support.
Read more >
FAQ: Running MPI jobs - Open MPI
I can run ompi_info and launch MPI jobs on a single host, but not across multiple hosts. ... What kind of CUDA support...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found