question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RuntimeError: Connection reset by peer] When scaling up training jobs

See original GitHub issue

I am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.

When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes, deepspeed.initialize() fails with:

Traceback (most recent call last):  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 947, in <module>
    main()  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 769, in main
    initialize_distributed(args)  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 703, in initialize_distributed
    deepspeed.init_distributed(distributed_port=29501)  
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed                                          
    torch.distributed.init_process_group(backend=dist_backend,  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: Connection reset by peer

I used the deepspeed version at the master branch. I ran my script with mpirun, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.

Any ideas on what’s going on?

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
g-karthikcommented, May 6, 2021

Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:

https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04

Seems like the LD_PRELOAD hack won’t be needed any more? I see your PyTorch PRs haven’t been merged but I am assuming they’re not needed.

Does DeepSpeed support this base image?

1reaction
jeffracommented, Feb 18, 2021

Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Runtime error: connection reset by peer in init_process_group
Hi, I am trying to implement distributed fashion training using torch.distributed package. in torch.distributed.init_process_group(.)
Read more >
Troubleshooting SSL - GitLab Docs
This command's output shows you the certificate chain, any public certificates the server presents, along with validation or connection errors if they occur....
Read more >
Troubleshooting kubeadm | Kubernetes
A possible solution is to restart the container runtime and then re-run kubeadm reset . You can also use crictl to debug the...
Read more >
Close the client to avoid the connections limit - Amazon Neptune
At that point, you must restart the Neptune instance to close the connections. The advice to call cluster.close() does not apply to Java...
Read more >
Python (Pytorch) Multiprocessing throwing errors: Connection ...
Basically, I have a (very) large data file of mini-batches and I want to have my CPU grab mini-batches and populate a queue...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found