question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tremendous slowdown in multi-node distributed training

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bert Finetuning a bert-base model on language modeling for a particular domain

Language I am using the model on (English, Chinese …):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Training is happening on Azure NC24s_v3 nodes (4 V100s each) with NCCL as the backend. I’m comparing the performance on a single node scenario vs a 2 node scenario. Note that there is no infiniband networking between the nodes, only 40Gbps ethernet.
  2. Use torch.distributed.launch to launch run_language_modeling.py in single node (mult-gpu) and multi node (mult-gpu) scenarios

Expected behavior

In the single node scenario, I’m getting about 2 iteration/sec during training. In the multi-node scenario, it drops to 4 sec/iteration.

Theorizing that the network was the issue, I reduced the model size significantly (down to 1 layer from 12 and the other hyperparameters also scaled down appropriately) and ran the test again. The same slowdown in performance was observed even then.

Am I missing something here? Is it possible to perform multi-node training of bert models without infiniband?

Environment info

  • transformers version: 2.5.1
  • Platform: Ubuntu 18.04
  • Python version: 3.6.10
  • PyTorch version (GPU?): 1.5.0
  • Tensorflow version (GPU?):
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Genius1237commented, Nov 19, 2020

Yes. Ran faster on a system with infiniband.

On Thu, Nov 19, 2020 at 6:08 PM gvijqb notifications@github.com wrote:

Is there any update on this? I suppose that not having infiniband interconnect was the only limiting factor?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/3274#issuecomment-730347349, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZB3Y3PKZBAOFOP5SSA3BLSQUGUXANCNFSM4LJHN2QA .

– Regards Anirudh Srinivasan Research Fellow Microsoft Research, India

0reactions
gvijqbcommented, Nov 19, 2020

Is there any update on this? I suppose that not having infiniband interconnect was the only limiting factor?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training on multiple GPUs and multi-node training ... - YouTube
In this video we'll cover how multi-GPU and multi-node training works in general.We'll also show how to do this using PyTorch ...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
Train 1 trillion+ parameter models - PyTorch Lightning
Due to high distributed communication between devices, if running on a slow network/interconnect, the training might be much slower than expected and then ......
Read more >
Distributed Deep Learning: From Single-Node to Multi ... - MDPI
Abstract: During the last years, deep learning (DL) models have been used in several applications with large datasets and complex models.
Read more >
Multi-GPU and distributed training using Horovod in Amazon ...
However, if the dataset is huge, it takes a long time to copy objects from the bucket to the training instances' storage, and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found