[RuntimeError: Connection reset by peer] When scaling up training jobs
See original GitHub issueI am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.
When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes, deepspeed.initialize()
fails with:
Traceback (most recent call last): File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 947, in <module>
main()
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 769, in main
initialize_distributed(args)
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 703, in initialize_distributed
deepspeed.init_distributed(distributed_port=29501)
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed
torch.distributed.init_process_group(backend=dist_backend,
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: Connection reset by peer
I used the deepspeed version at the master branch. I ran my script with mpirun
, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.
Any ideas on what’s going on?
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (6 by maintainers)
Top Results From Across the Web
Runtime error: connection reset by peer in init_process_group
Hi, I am trying to implement distributed fashion training using torch.distributed package. in torch.distributed.init_process_group(.)
Read more >Troubleshooting SSL - GitLab Docs
This command's output shows you the certificate chain, any public certificates the server presents, along with validation or connection errors if they occur....
Read more >Troubleshooting kubeadm | Kubernetes
A possible solution is to restart the container runtime and then re-run kubeadm reset . You can also use crictl to debug the...
Read more >Close the client to avoid the connections limit - Amazon Neptune
At that point, you must restart the Neptune instance to close the connections. The advice to call cluster.close() does not apply to Java...
Read more >Python (Pytorch) Multiprocessing throwing errors: Connection ...
Basically, I have a (very) large data file of mini-batches and I want to have my CPU grab mini-batches and populate a queue...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:
https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04
Seems like the LD_PRELOAD hack won’t be needed any more? I see your PyTorch PRs haven’t been merged but I am assuming they’re not needed.
Does DeepSpeed support this base image?
Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha 😃