Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tremendous slowdown in multi-node distributed training

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bert Finetuning a bert-base model on language modeling for a particular domain

Language I am using the model on (English, Chinese …):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Training is happening on Azure NC24s_v3 nodes (4 V100s each) with NCCL as the backend. I’m comparing the performance on a single node scenario vs a 2 node scenario. Note that there is no infiniband networking between the nodes, only 40Gbps ethernet.
Use torch.distributed.launch to launch run_language_modeling.py in single node (mult-gpu) and multi node (mult-gpu) scenarios

Expected behavior

In the single node scenario, I’m getting about 2 iteration/sec during training. In the multi-node scenario, it drops to 4 sec/iteration.

Theorizing that the network was the issue, I reduced the model size significantly (down to 1 layer from 12 and the other hyperparameters also scaled down appropriately) and ran the test again. The same slowdown in performance was observed even then.

Am I missing something here? Is it possible to perform multi-node training of bert models without infiniband?

Environment info

transformers version: 2.5.1
Platform: Ubuntu 18.04
Python version: 3.6.10
PyTorch version (GPU?): 1.5.0
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Issue Analytics

State:
Created 4 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

Genius1237commented, Nov 19, 2020

Yes. Ran faster on a system with infiniband.

On Thu, Nov 19, 2020 at 6:08 PM gvijqb notifications@github.com wrote:

Is there any update on this? I suppose that not having infiniband interconnect was the only limiting factor?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/3274#issuecomment-730347349, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZB3Y3PKZBAOFOP5SSA3BLSQUGUXANCNFSM4LJHN2QA .

– Regards Anirudh Srinivasan Research Fellow Microsoft Research, India

0reactions

gvijqbcommented, Nov 19, 2020

Is there any update on this? I suppose that not having infiniband interconnect was the only limiting factor?

Top Results From Across the Web

Training on multiple GPUs and multi-node training ... - YouTube

In this video we'll cover how multi-GPU and multi-node training works in general.We'll also show how to do this using PyTorch ...

Efficient Training on Multiple GPUs - Hugging Face

When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...

Train 1 trillion+ parameter models - PyTorch Lightning

Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then ......

Distributed Deep Learning: From Single-Node to Multi ... - MDPI

Abstract: During the last years, deep learning (DL) models have been used in several applications with large datasets and complex models.

Multi-GPU and distributed training using Horovod in Amazon ...

However, if the dataset is huge, it takes a long time to copy objects from the bucket to the training instances' storage, and...