Tremendous slowdown in multi-node distributed training
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): Bert Finetuning a bert-base model on language modeling for a particular domain
Language I am using the model on (English, Chinese …):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Training is happening on Azure NC24s_v3 nodes (4 V100s each) with NCCL as the backend. I’m comparing the performance on a single node scenario vs a 2 node scenario. Note that there is no infiniband networking between the nodes, only 40Gbps ethernet.
- Use torch.distributed.launch to launch
run_language_modeling.py
in single node (mult-gpu) and multi node (mult-gpu) scenarios
Expected behavior
In the single node scenario, I’m getting about 2 iteration/sec during training. In the multi-node scenario, it drops to 4 sec/iteration.
Theorizing that the network was the issue, I reduced the model size significantly (down to 1 layer from 12 and the other hyperparameters also scaled down appropriately) and ran the test again. The same slowdown in performance was observed even then.
Am I missing something here? Is it possible to perform multi-node training of bert models without infiniband?
Environment info
transformers
version: 2.5.1- Platform: Ubuntu 18.04
- Python version: 3.6.10
- PyTorch version (GPU?): 1.5.0
- Tensorflow version (GPU?):
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Yes. Ran faster on a system with infiniband.
On Thu, Nov 19, 2020 at 6:08 PM gvijqb notifications@github.com wrote:
– Regards Anirudh Srinivasan Research Fellow Microsoft Research, India
Is there any update on this? I suppose that not having infiniband interconnect was the only limiting factor?