question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Multi-Node Address in Use Error

See original GitHub issue

Describe the bug I am running deepspeed on AWS Sagemaker over 4 machines with 4V100’s each (their p3.8xlarge’s). I use a custom docker image in order to run deepspeed (Sagemaker does not natively support deepspeed).

The command used is:

deepspeed --hostfile=hostfile.txt --master_port=29600 main.py

What I believe happens, is that for $$m$$ machines with $$n$$ GPU’s each, instead of starting $$mn$$ processes, deepspeed starts $$m^2n$$. Each machine shows that deepspeed is running 16 processes, but these are all unique processes resulting in 64 being run at the same time.

I verified that it was actually 64, and not just 16 with the logs being output to each machine, since the logs differed (only one of the machines did not error). I also printed out time.time() at the same point in the code, and obtained different results for what should have been the same process on the logs of the different machines.

Running the same exact code on just 1 machine works perfectly fine, the issue is when performing distributed training. As a preface, algo-1 is the hostname of the machine that fails.

The exact error log is: algo-1: Traceback (most recent call last): algo-1: File “main.py”, line 191, in <module> algo-1: main(args) algo-1: File “main.py”, line 185, in main algo-1: trainer = DeepspeedTrainer(model, train_dataset, valid_dataset, args) algo-1: File “/cursor-ml/cad_ml/train/train_deepspeed.py”, line 28, in init algo-1: super().init(model, train_dataset, valid_dataset, args) algo-1: File “/cursor-ml/cad_ml/train/train.py”, line 31, in init algo-1: self.setup() algo-1: File “/cursor-ml/cad_ml/train/train_deepspeed.py”, line 37, in setup algo-1: self.engine, _, __, ___ = deepspeed.initialize( algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/init.py”, line 119, in initialize algo-1: engine = DeepSpeedEngine(args=args, algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 233, in init algo-1: init_distributed(dist_backend=self.dist_backend) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/utils/distributed.py”, line 49, in init_distributed algo-1: torch.distributed.init_process_group(backend=dist_backend, algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 576, in init_process_group algo-1: store, rank, world_size = next(rendezvous_iterator) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 229, in _env_rendezvous_handler algo-1: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 157, in _create_c10d_store algo-1: return TCPStore( algo-1: RuntimeError: Address already in use   To Reproduce Steps to reproduce the behavior:

  1. Start a sagemaker job on 4 ml.p3.8xlarge instances
  2. Run deepspeed --hostfile=hostfile.txt --master_port=29600 main.py given a standard training loop

Expected behavior I expect the training script to properly execute

ds_report output Please run ds_report to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja … [OKAY]

op name … installed … compatible

cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_adam … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] sparse_attn … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] async_io … [NO] … [OKAY] transformer_inference … [NO] … [OKAY] utils … [NO] … [OKAY] quantizer … [NO] … [OKAY]

DeepSpeed general environment info: torch install path … [‘/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch’] torch version … 1.10.2+cu102 torch cuda version … 10.2 nvcc version … 10.0 deepspeed install path … [‘/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed’] deepspeed info … 0.5.10, unknown, unknown deepspeed wheel compiled w. … torch 0.0, cuda 0.0

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • 4 machines with 4 V100’s each
  • I believe four machines with 10 GB/s bandwidth
  • Python 3.8.12

Launcher context I am using the Deepspeed launcher.

Docker context I believe it resembles the Deepspeed dockerfile, but with a few additional changes.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
Sanger2000commented, May 16, 2022

Hey @tjruwase, running deepspeed from a single machine rather than all at the same time solved this issue.

Thanks!

0reactions
tjruwasecommented, May 16, 2022

@Sanger2000, do you have any updates on this issue? Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

"0x0000009E" Stop error on cluster nodes in a Windows ...
This issue occurs because a remove lock on a logical unit number (LUN) is obtained two times, but only released one time. Therefore,...
Read more >
[Assisted-4.9 ][SaaS] multinode cluster deployment failed on ...
Bug 2018498 - [Assisted-4.9 ][SaaS] multinode cluster deployment failed on Timeout while waiting for cluster operators to be available.
Read more >
Bug #1936420 “tempest.scenario.test_network_basic_ops ...
I think I found why this test if failing in Train jobs. In train we are using ovn-2.12.0-10.el8 and there was bug in...
Read more >
nwchem (ARMCI) fails in multi-node execution with openmpi
Bug#1005951 : nwchem (ARMCI) fails in multi-node execution with ... generally: such peer hangups are frequently caused by application bugs
Read more >
Troubleshooting - MicroK8s
This is essential for bug reports, but is also a useful way of confirming the ... Inspecting services Service snap.microk8s.daemon-cluster-agent is running ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found