[BUG] Multi-Node Address in Use Error
See original GitHub issueDescribe the bug I am running deepspeed on AWS Sagemaker over 4 machines with 4V100’s each (their p3.8xlarge’s). I use a custom docker image in order to run deepspeed (Sagemaker does not natively support deepspeed).
The command used is:
deepspeed --hostfile=hostfile.txt --master_port=29600 main.py
What I believe happens, is that for $$m$$ machines with $$n$$ GPU’s each, instead of starting $$mn$$ processes, deepspeed starts $$m^2n$$. Each machine shows that deepspeed is running 16 processes, but these are all unique processes resulting in 64 being run at the same time.
I verified that it was actually 64, and not just 16 with the logs being output to each machine, since the logs differed (only one of the machines did not error). I also printed out time.time() at the same point in the code, and obtained different results for what should have been the same process on the logs of the different machines.
Running the same exact code on just 1 machine works perfectly fine, the issue is when performing distributed training. As a preface, algo-1 is the hostname of the machine that fails.
The exact error log is: algo-1: Traceback (most recent call last): algo-1: File “main.py”, line 191, in <module> algo-1: main(args) algo-1: File “main.py”, line 185, in main algo-1: trainer = DeepspeedTrainer(model, train_dataset, valid_dataset, args) algo-1: File “/cursor-ml/cad_ml/train/train_deepspeed.py”, line 28, in init algo-1: super().init(model, train_dataset, valid_dataset, args) algo-1: File “/cursor-ml/cad_ml/train/train.py”, line 31, in init algo-1: self.setup() algo-1: File “/cursor-ml/cad_ml/train/train_deepspeed.py”, line 37, in setup algo-1: self.engine, _, __, ___ = deepspeed.initialize( algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/init.py”, line 119, in initialize algo-1: engine = DeepSpeedEngine(args=args, algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/runtime/engine.py”, line 233, in init algo-1: init_distributed(dist_backend=self.dist_backend) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed/utils/distributed.py”, line 49, in init_distributed algo-1: torch.distributed.init_process_group(backend=dist_backend, algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 576, in init_process_group algo-1: store, rank, world_size = next(rendezvous_iterator) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 229, in _env_rendezvous_handler algo-1: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) algo-1: File “/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 157, in _create_c10d_store algo-1: return TCPStore( algo-1: RuntimeError: Address already in use To Reproduce Steps to reproduce the behavior:
- Start a sagemaker job on 4 ml.p3.8xlarge instances
- Run
deepspeed --hostfile=hostfile.txt --master_port=29600 main.py
given a standard training loop
Expected behavior I expect the training script to properly execute
ds_report output
Please run ds_report
to give us details about your setup.
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja … [OKAY]
op name … installed … compatible
cpu_adam … [NO] … [OKAY] cpu_adagrad … [NO] … [OKAY] fused_adam … [NO] … [OKAY] fused_lamb … [NO] … [OKAY] sparse_attn … [NO] … [OKAY] transformer … [NO] … [OKAY] stochastic_transformer . [NO] … [OKAY] async_io … [NO] … [OKAY] transformer_inference … [NO] … [OKAY] utils … [NO] … [OKAY] quantizer … [NO] … [OKAY]
DeepSpeed general environment info: torch install path … [‘/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/torch’] torch version … 1.10.2+cu102 torch cuda version … 10.2 nvcc version … 10.0 deepspeed install path … [‘/root/anaconda3/envs/cursor-ml/lib/python3.8/site-packages/deepspeed’] deepspeed info … 0.5.10, unknown, unknown deepspeed wheel compiled w. … torch 0.0, cuda 0.0
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 18.04
- 4 machines with 4 V100’s each
- I believe four machines with 10 GB/s bandwidth
- Python 3.8.12
Launcher context I am using the Deepspeed launcher.
Docker context I believe it resembles the Deepspeed dockerfile, but with a few additional changes.
Issue Analytics
- State:
- Created a year ago
- Comments:13 (13 by maintainers)
Top GitHub Comments
Hey @tjruwase, running deepspeed from a single machine rather than all at the same time solved this issue.
Thanks!
@Sanger2000, do you have any updates on this issue? Thanks!