question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Big difference between training with 1 GPU vs 8 GPUs (via DDP)

See original GitHub issue

Hi all,

I am using the recipe LibriSpeech/ASR/transformer with the hyperparameters saved in hparams/conformer_small.yaml.

I firstly tried training on a single GPU with the following result

epoch: 1, lr: 1.76e-04, steps: 4394, optimizer: Adam - train loss: 2.61e+02 - valid loss: 1.41e+02, valid ACC: 1.86e-01
epoch: 2, lr: 3.51e-04, steps: 8788, optimizer: Adam - train loss: 1.48e+02 - valid loss: 1.11e+02, valid ACC: 3.76e-01
epoch: 3, lr: 5.27e-04, steps: 13182, optimizer: Adam - train loss: 67.33 - valid loss: 58.62, valid ACC: 6.64e-01
epoch: 4, lr: 7.03e-04, steps: 17576, optimizer: Adam - train loss: 49.27 - valid loss: 75.51, valid ACC: 6.06e-01
epoch: 5, lr: 8.79e-04, steps: 21970, optimizer: Adam - train loss: 41.83 - valid loss: 21.69, valid ACC: 8.57e-01
epoch: 6, lr: 9.74e-04, steps: 26364, optimizer: Adam - train loss: 37.18 - valid loss: 14.90, valid ACC: 8.99e-01
epoch: 7, lr: 9.02e-04, steps: 30758, optimizer: Adam - train loss: 32.34 - valid loss: 12.56, valid ACC: 9.14e-01
epoch: 8, lr: 8.43e-04, steps: 35152, optimizer: Adam - train loss: 28.62 - valid loss: 15.06, valid ACC: 8.97e-01
epoch: 9, lr: 7.95e-04, steps: 39546, optimizer: Adam - train loss: 25.86 - valid loss: 10.85, valid ACC: 9.24e-01
epoch: 10, lr: 7.54e-04, steps: 43940, optimizer: Adam - train loss: 23.81 - valid loss: 10.45, valid ACC: 9.28e-01, valid WER: 7.79

which makes sense to me. However, when I move to 8 GPUs (and changing the gradient_accumulation from 4 to 1 and reducing batch_size from 16 to 8 in order to get the same global batch size as the case of 1 GPU) by running python -m torch.distributed.launch --nproc_per_node=8 train.py hparams/conformer_small.yaml --distributed_launch --distributed_backend=nccl I get

epoch: 1, lr: 1.76e-04, steps: 4395, optimizer: Adam - train loss: 2.76e+02 - valid loss: 1.55e+02, valid ACC: 1.50e-01
epoch: 2, lr: 3.52e-04, steps: 8790, optimizer: Adam - train loss: 2.41e+02 - valid loss: 1.49e+02, valid ACC: 1.72e-01
epoch: 3, lr: 5.27e-04, steps: 13185, optimizer: Adam - train loss: 2.35e+02 - valid loss: 1.48e+02, valid ACC: 1.83e-01
epoch: 4, lr: 7.03e-04, steps: 17580, optimizer: Adam - train loss: 2.30e+02 - valid loss: 1.49e+02, valid ACC: 1.87e-01
epoch: 5, lr: 8.79e-04, steps: 21975, optimizer: Adam - train loss: 2.27e+02 - valid loss: 1.49e+02, valid ACC: 1.94e-01
epoch: 6, lr: 9.74e-04, steps: 26370, optimizer: Adam - train loss: 2.25e+02 - valid loss: 1.49e+02, valid ACC: 1.96e-01
epoch: 7, lr: 9.01e-04, steps: 30765, optimizer: Adam - train loss: 2.22e+02 - valid loss: 1.49e+02, valid ACC: 2.01e-01
epoch: 8, lr: 8.43e-04, steps: 35160, optimizer: Adam - train loss: 2.18e+02 - valid loss: 1.50e+02, valid ACC: 2.03e-01
epoch: 9, lr: 7.95e-04, steps: 39555, optimizer: Adam - train loss: 2.15e+02 - valid loss: 1.52e+02, valid ACC: 2.03e-01
epoch: 10, lr: 7.54e-04, steps: 43950, optimizer: Adam - train loss: 2.13e+02 - valid loss: 1.54e+02, valid ACC: 2.02e-01, valid WER: 2.58e+02
epoch: 11, lr: 7.19e-04, steps: 48345, optimizer: Adam - train loss: 2.11e+02 - valid loss: 1.54e+02, valid ACC: 2.00e-01
epoch: 12, lr: 6.89e-04, steps: 52740, optimizer: Adam - train loss: 2.09e+02 - valid loss: 1.55e+02, valid ACC: 1.98e-01
epoch: 13, lr: 6.61e-04, steps: 57135, optimizer: Adam - train loss: 2.07e+02 - valid loss: 1.56e+02, valid ACC: 1.97e-01
epoch: 14, lr: 6.37e-04, steps: 61530, optimizer: Adam - train loss: 2.06e+02 - valid loss: 1.56e+02, valid ACC: 1.95e-01
epoch: 15, lr: 6.16e-04, steps: 65925, optimizer: Adam - train loss: 2.04e+02 - valid loss: 1.57e+02, valid ACC: 1.93e-01
epoch: 16, lr: 5.96e-04, steps: 70320, optimizer: Adam - train loss: 2.03e+02 - valid loss: 1.58e+02, valid ACC: 1.91e-01
epoch: 17, lr: 5.78e-04, steps: 74715, optimizer: Adam - train loss: 2.02e+02 - valid loss: 1.58e+02, valid ACC: 1.89e-01
epoch: 18, lr: 5.62e-04, steps: 79110, optimizer: Adam - train loss: 2.01e+02 - valid loss: 1.59e+02, valid ACC: 1.87e-01
epoch: 19, lr: 5.47e-04, steps: 83505, optimizer: Adam - train loss: 2.00e+02 - valid loss: 1.57e+02, valid ACC: 1.87e-01
epoch: 20, lr: 5.33e-04, steps: 87900, optimizer: Adam - train loss: 1.99e+02 - valid loss: 1.59e+02, valid ACC: 1.85e-01, valid WER: 1.97e+02
epoch: 21, lr: 5.20e-04, steps: 92295, optimizer: Adam - train loss: 1.99e+02 - valid loss: 1.58e+02, valid ACC: 1.84e-01
epoch: 22, lr: 5.08e-04, steps: 96690, optimizer: Adam - train loss: 1.98e+02 - valid loss: 1.59e+02, valid ACC: 1.83e-01

which seems to indicate that there is something wrong because the train_loss seems to get stuck. I tried also to increase the global batch size from 64 to 128 by increasing the batch size to 16 but I get similar outcome. Is this a problem related to DDP and therefore an expected outcome?

Thanks for your help.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:21

github_iconTop GitHub Comments

1reaction
TParcolletcommented, Jul 30, 2021

We spotted few errors in the recipe that could cause issues with multi-GPU. Let us work a bit on that part 😃

0reactions
PabloPesocommented, Sep 7, 2021

Writing this comment just to help closing this issue.

I have now tested it with the right backward steps per epoch and it seems pretty close to the 1 GPU case. In order to have the same steps when using 8 GPUs I have set batch_size: 8 and gradient_accumulation: 1. Here are the results:

epoch: 1, lr: 1.76e-04, steps: 4395, optimizer: Adam - train loss: 2.61e+02 - valid loss: 1.43e+02, valid ACC: 1.85e-01
epoch: 2, lr: 3.52e-04, steps: 8790, optimizer: Adam - train loss: 1.55e+02 - valid loss: 94.45, valid ACC: 4.31e-01
epoch: 3, lr: 5.27e-04, steps: 13185, optimizer: Adam - train loss: 71.98 - valid loss: 44.09, valid ACC: 7.05e-01
epoch: 4, lr: 7.03e-04, steps: 17580, optimizer: Adam - train loss: 52.36 - valid loss: 38.74, valid ACC: 7.63e-01
epoch: 5, lr: 8.79e-04, steps: 21975, optimizer: Adam - train loss: 44.02 - valid loss: 22.74, valid ACC: 8.50e-01
epoch: 6, lr: 9.74e-04, steps: 26370, optimizer: Adam - train loss: 38.70 - valid loss: 16.57, valid ACC: 8.87e-01
epoch: 7, lr: 9.01e-04, steps: 30765, optimizer: Adam - train loss: 33.46 - valid loss: 14.13, valid ACC: 9.06e-01
epoch: 8, lr: 8.43e-04, steps: 35160, optimizer: Adam - train loss: 29.12 - valid loss: 15.11, valid ACC: 8.98e-01
epoch: 9, lr: 7.95e-04, steps: 39555, optimizer: Adam - train loss: 26.34 - valid loss: 13.50, valid ACC: 9.10e-01
epoch: 10, lr: 7.54e-04, steps: 43950, optimizer: Adam - train loss: 24.53 - valid loss: 12.55, valid ACC: 9.15e-01, valid WER: 8.61

This is very close to initial results I included in the first comment for 1 GPU (copying results below)

epoch: 1, lr: 1.76e-04, steps: 4394, optimizer: Adam - train loss: 2.61e+02 - valid loss: 1.41e+02, valid ACC: 1.86e-01
epoch: 2, lr: 3.51e-04, steps: 8788, optimizer: Adam - train loss: 1.48e+02 - valid loss: 1.11e+02, valid ACC: 3.76e-01
epoch: 3, lr: 5.27e-04, steps: 13182, optimizer: Adam - train loss: 67.33 - valid loss: 58.62, valid ACC: 6.64e-01
epoch: 4, lr: 7.03e-04, steps: 17576, optimizer: Adam - train loss: 49.27 - valid loss: 75.51, valid ACC: 6.06e-01
epoch: 5, lr: 8.79e-04, steps: 21970, optimizer: Adam - train loss: 41.83 - valid loss: 21.69, valid ACC: 8.57e-01
epoch: 6, lr: 9.74e-04, steps: 26364, optimizer: Adam - train loss: 37.18 - valid loss: 14.90, valid ACC: 8.99e-01
epoch: 7, lr: 9.02e-04, steps: 30758, optimizer: Adam - train loss: 32.34 - valid loss: 12.56, valid ACC: 9.14e-01
epoch: 8, lr: 8.43e-04, steps: 35152, optimizer: Adam - train loss: 28.62 - valid loss: 15.06, valid ACC: 8.97e-01
epoch: 9, lr: 7.95e-04, steps: 39546, optimizer: Adam - train loss: 25.86 - valid loss: 10.85, valid ACC: 9.24e-01
epoch: 10, lr: 7.54e-04, steps: 43940, optimizer: Adam - train loss: 23.81 - valid loss: 10.45, valid ACC: 9.28e-01, valid WER: 7.79

Thanks for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

DDP on 8 GPUs vs. Single GPU training speed - distributed
Very interesting project! So basically training with 4 GPUS needs 4 epochs to get the same results like a single GPU achieves in...
Read more >
Efficient Training on Multiple GPUs - Hugging Face
DP copies data within the process via python threads, whereas DDP copies data via torch.distributed. Under DP gpu 0 performs a lot more...
Read more >
DDP with 2 GPUs doesn't give same results as 1 GPU with the ...
I am using PyTorch Lightning 1.5.4 for training my models with 1 vs. 2 vs. 4 GPUs. When ensuring the same effective batch...
Read more >
Why Parallelized Training Might Not be Working for You
With DDP training the dataset is divided amongst the number of available GPUs. Lets run a set of experiments with using the Pytorch...
Read more >
0/1% GPU Utilization when using 1 GPU, but Higher GPU ...
Hello, I have a tricky bug where just 1 GPU for training gets little to no utilization but using 2+ GPUs (with DDP)...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found