Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Multi-Node Address in Use Error

See original GitHub issue

Describe the bug I am running deepspeed on AWS Sagemaker over 4 machines with 4V100’s each (their p3.8xlarge’s). I use a custom docker image in order to run deepspeed (Sagemaker does not natively support deepspeed).

The command used is:

deepspeed --hostfile=hostfile.txt --master_port=29600 main.py

What I believe happens, is that for $$m$$ machines with $$n$$ GPU’s each, instead of starting $$mn$$ processes, deepspeed starts $$m^2n$$. Each machine shows that deepspeed is running 16 processes, but these are all unique processes resulting in 64 being run at the same time.

I verified that it was actually 64, and not just 16 with the logs being output to each machine, since the logs differed (only one of the machines did not error). I also printed out time.time() at the same point in the code, and obtained different results for what should have been the same process on the logs of the different machines.

Running the same exact code on just 1 machine works perfectly fine, the issue is when performing distributed training. As a preface, algo-1 is the hostname of the machine that fails.

The exact error log is: algo-1: Traceback (most recent call last): algo-1: File “main.py”, line 191, in <module> algo-1: main(args) algo-1: File “main.py”, line 185, in main algo-1: trainer = DeepspeedTrainer(model, train_dataset, valid_dataset, args) algo-1: File “/cursor-ml/cad_ml/train/train_deepspeed.py”, line 28, in init algo-1: super().init(model, train_dataset, valid_dataset, args) algo-1: File “/cursor-ml/cad_ml/train/train.py”, line 31, in init algo-1: self.setup() algo-1: File “/cursor-ml/cad_ml/train/train_deepspeed.py”, line 37, in setup algo-1: self.engine, _, __, ___ = deepspeed.initialize( algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/init.py”, line 119, in initialize algo-1: engine = DeepSpeedEngine(args=args, algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 233, in init algo-1: init_distributed(dist_backend=self.dist_backend) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/utils/distributed.py”, line 49, in init_distributed algo-1: torch.distributed.init_process_group(backend=dist_backend, algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 576, in init_process_group algo-1: store, rank, world_size = next(rendezvous_iterator) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 229, in _env_rendezvous_handler algo-1: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 157, in _create_c10d_store algo-1: return TCPStore( algo-1: RuntimeError: Address already in use To Reproduce Steps to reproduce the behavior:

Start a sagemaker job on 4 ml.p3.8xlarge instances
Run deepspeed --hostfile=hostfile.txt --master_port=29600 main.py given a standard training loop

Expected behavior I expect the training script to properly execute

ds_report output Please run `ds_report` to give us details about your setup.

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja … [OKAY]

op name … installed … compatible

cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_adam … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] sparse_attn … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] async_io … [NO] … [OKAY] transformer_inference … [NO] … [OKAY] utils … [NO] … [OKAY] quantizer … [NO] … [OKAY]

DeepSpeed general environment info: torch install path … [‘/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch’] torch version … 1.10.2+cu102 torch cuda version … 10.2 nvcc version … 10.0 deepspeed install path … [‘/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed’] deepspeed info … 0.5.10, unknown, unknown deepspeed wheel compiled w. … torch 0.0, cuda 0.0

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 18.04
4 machines with 4 V100’s each
I believe four machines with 10 GB/s bandwidth
Python 3.8.12

Launcher context I am using the Deepspeed launcher.

Docker context I believe it resembles the Deepspeed dockerfile, but with a few additional changes.

Issue Analytics

State:
Created a year ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

Sanger2000commented, May 16, 2022

Hey @tjruwase, running deepspeed from a single machine rather than all at the same time solved this issue.

Thanks!

0reactions

tjruwasecommented, May 16, 2022

@Sanger2000, do you have any updates on this issue? Thanks!

Top Results From Across the Web

"0x0000009E" Stop error on cluster nodes in a Windows ...

This issue occurs because a remove lock on a logical unit number (LUN) is obtained two times, but only released one time. Therefore,...

[Assisted-4.9 ][SaaS] multinode cluster deployment failed on ...

Bug 2018498 - [Assisted-4.9 ][SaaS] multinode cluster deployment failed on Timeout while waiting for cluster operators to be available.

Bug #1936420 “tempest.scenario.test_network_basic_ops ...

I think I found why this test if failing in Train jobs. In train we are using ovn-2.12.0-10.el8 and there was bug in...

nwchem (ARMCI) fails in multi-node execution with openmpi

Bug#1005951 : nwchem (ARMCI) fails in multi-node execution with ... generally: such peer hangups are frequently caused by application bugs

Troubleshooting - MicroK8s

This is essential for bug reports, but is also a useful way of confirming the ... Inspecting services Service snap.microk8s.daemon-cluster-agent is running ...