Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[RuntimeError: Connection reset by peer] When scaling up training jobs

See original GitHub issue

I am facing a similar problem as the one posted by @g-karthik in https://github.com/microsoft/DeepSpeed/issues/570#issuecomment-750744107.

When I use 40 nodes with 10 gpus on each node (400 jobs), the training works well. But when I scale up the training to 40 or more nodes, deepspeed.initialize() fails with:

Traceback (most recent call last):  File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 947, in <module>
    main()  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 769, in main
    initialize_distributed(args)  
File "/home/hanwentao/work/enc-dec-pretrain/Megatron-LM/pretrain_enc_dec.py", line 703, in initialize_distributed
    deepspeed.init_distributed(distributed_port=29501)  
File "/home/hanwentao/.local/lib/python3.8/site-packages/deepspeed-0.3.11+4f1d827-py3.8.egg/deepspeed/utils/distributed.py", line 49, in init_distributed                                          
    torch.distributed.init_process_group(backend=dist_backend,  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()  
File "/home/hanwentao/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: Connection reset by peer

I used the deepspeed version at the master branch. I ran my script with mpirun, just as described in https://www.deepspeed.ai/getting-started/#mpi-and-azureml-compatibility.

Any ideas on what’s going on?

Issue Analytics

State:
Created 3 years ago
Comments:15 (6 by maintainers)

Top GitHub Comments

1reaction

g-karthikcommented, May 6, 2021

Hey @jeffra! It looks like FB published a new Docker image for the latest PyTorch, with NCCL 2.9.6:

https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-04.html#rel_21-04

Seems like the LD_PRELOAD hack won’t be needed any more? I see your PyTorch PRs haven’t been merged but I am assuming they’re not needed.

Does DeepSpeed support this base image?

1reaction

jeffracommented, Feb 18, 2021

Hi @g-karthik, it should probably work? But I have not tried it myself with torch 1.4. Sorry for a less than confident answer there haha 😃

Top Results From Across the Web

Runtime error: connection reset by peer in init_process_group

Hi, I am trying to implement distributed fashion training using torch.distributed package. in torch.distributed.init_process_group(.)

Troubleshooting SSL - GitLab Docs

This command's output shows you the certificate chain, any public certificates the server presents, along with validation or connection errors if they occur....

Troubleshooting kubeadm | Kubernetes

A possible solution is to restart the container runtime and then re-run kubeadm reset . You can also use crictl to debug the...

Close the client to avoid the connections limit - Amazon Neptune

At that point, you must restart the Neptune instance to close the connections. The advice to call cluster.close() does not apply to Java...

Python (Pytorch) Multiprocessing throwing errors: Connection ...

Basically, I have a (very) large data file of mini-batches and I want to have my CPU grab mini-batches and populate a queue...