Big difference between training with 1 GPU vs 8 GPUs (via DDP)
See original GitHub issueHi all,
I am using the recipe LibriSpeech/ASR/transformer with the hyperparameters saved in hparams/conformer_small.yaml.
I firstly tried training on a single GPU with the following result
epoch: 1, lr: 1.76e-04, steps: 4394, optimizer: Adam - train loss: 2.61e+02 - valid loss: 1.41e+02, valid ACC: 1.86e-01
epoch: 2, lr: 3.51e-04, steps: 8788, optimizer: Adam - train loss: 1.48e+02 - valid loss: 1.11e+02, valid ACC: 3.76e-01
epoch: 3, lr: 5.27e-04, steps: 13182, optimizer: Adam - train loss: 67.33 - valid loss: 58.62, valid ACC: 6.64e-01
epoch: 4, lr: 7.03e-04, steps: 17576, optimizer: Adam - train loss: 49.27 - valid loss: 75.51, valid ACC: 6.06e-01
epoch: 5, lr: 8.79e-04, steps: 21970, optimizer: Adam - train loss: 41.83 - valid loss: 21.69, valid ACC: 8.57e-01
epoch: 6, lr: 9.74e-04, steps: 26364, optimizer: Adam - train loss: 37.18 - valid loss: 14.90, valid ACC: 8.99e-01
epoch: 7, lr: 9.02e-04, steps: 30758, optimizer: Adam - train loss: 32.34 - valid loss: 12.56, valid ACC: 9.14e-01
epoch: 8, lr: 8.43e-04, steps: 35152, optimizer: Adam - train loss: 28.62 - valid loss: 15.06, valid ACC: 8.97e-01
epoch: 9, lr: 7.95e-04, steps: 39546, optimizer: Adam - train loss: 25.86 - valid loss: 10.85, valid ACC: 9.24e-01
epoch: 10, lr: 7.54e-04, steps: 43940, optimizer: Adam - train loss: 23.81 - valid loss: 10.45, valid ACC: 9.28e-01, valid WER: 7.79
which makes sense to me. However, when I move to 8 GPUs (and changing the gradient_accumulation
from 4 to 1 and reducing batch_size
from 16 to 8 in order to get the same global batch size as the case of 1 GPU) by running python -m torch.distributed.launch --nproc_per_node=8 train.py hparams/conformer_small.yaml --distributed_launch --distributed_backend=nccl
I get
epoch: 1, lr: 1.76e-04, steps: 4395, optimizer: Adam - train loss: 2.76e+02 - valid loss: 1.55e+02, valid ACC: 1.50e-01
epoch: 2, lr: 3.52e-04, steps: 8790, optimizer: Adam - train loss: 2.41e+02 - valid loss: 1.49e+02, valid ACC: 1.72e-01
epoch: 3, lr: 5.27e-04, steps: 13185, optimizer: Adam - train loss: 2.35e+02 - valid loss: 1.48e+02, valid ACC: 1.83e-01
epoch: 4, lr: 7.03e-04, steps: 17580, optimizer: Adam - train loss: 2.30e+02 - valid loss: 1.49e+02, valid ACC: 1.87e-01
epoch: 5, lr: 8.79e-04, steps: 21975, optimizer: Adam - train loss: 2.27e+02 - valid loss: 1.49e+02, valid ACC: 1.94e-01
epoch: 6, lr: 9.74e-04, steps: 26370, optimizer: Adam - train loss: 2.25e+02 - valid loss: 1.49e+02, valid ACC: 1.96e-01
epoch: 7, lr: 9.01e-04, steps: 30765, optimizer: Adam - train loss: 2.22e+02 - valid loss: 1.49e+02, valid ACC: 2.01e-01
epoch: 8, lr: 8.43e-04, steps: 35160, optimizer: Adam - train loss: 2.18e+02 - valid loss: 1.50e+02, valid ACC: 2.03e-01
epoch: 9, lr: 7.95e-04, steps: 39555, optimizer: Adam - train loss: 2.15e+02 - valid loss: 1.52e+02, valid ACC: 2.03e-01
epoch: 10, lr: 7.54e-04, steps: 43950, optimizer: Adam - train loss: 2.13e+02 - valid loss: 1.54e+02, valid ACC: 2.02e-01, valid WER: 2.58e+02
epoch: 11, lr: 7.19e-04, steps: 48345, optimizer: Adam - train loss: 2.11e+02 - valid loss: 1.54e+02, valid ACC: 2.00e-01
epoch: 12, lr: 6.89e-04, steps: 52740, optimizer: Adam - train loss: 2.09e+02 - valid loss: 1.55e+02, valid ACC: 1.98e-01
epoch: 13, lr: 6.61e-04, steps: 57135, optimizer: Adam - train loss: 2.07e+02 - valid loss: 1.56e+02, valid ACC: 1.97e-01
epoch: 14, lr: 6.37e-04, steps: 61530, optimizer: Adam - train loss: 2.06e+02 - valid loss: 1.56e+02, valid ACC: 1.95e-01
epoch: 15, lr: 6.16e-04, steps: 65925, optimizer: Adam - train loss: 2.04e+02 - valid loss: 1.57e+02, valid ACC: 1.93e-01
epoch: 16, lr: 5.96e-04, steps: 70320, optimizer: Adam - train loss: 2.03e+02 - valid loss: 1.58e+02, valid ACC: 1.91e-01
epoch: 17, lr: 5.78e-04, steps: 74715, optimizer: Adam - train loss: 2.02e+02 - valid loss: 1.58e+02, valid ACC: 1.89e-01
epoch: 18, lr: 5.62e-04, steps: 79110, optimizer: Adam - train loss: 2.01e+02 - valid loss: 1.59e+02, valid ACC: 1.87e-01
epoch: 19, lr: 5.47e-04, steps: 83505, optimizer: Adam - train loss: 2.00e+02 - valid loss: 1.57e+02, valid ACC: 1.87e-01
epoch: 20, lr: 5.33e-04, steps: 87900, optimizer: Adam - train loss: 1.99e+02 - valid loss: 1.59e+02, valid ACC: 1.85e-01, valid WER: 1.97e+02
epoch: 21, lr: 5.20e-04, steps: 92295, optimizer: Adam - train loss: 1.99e+02 - valid loss: 1.58e+02, valid ACC: 1.84e-01
epoch: 22, lr: 5.08e-04, steps: 96690, optimizer: Adam - train loss: 1.98e+02 - valid loss: 1.59e+02, valid ACC: 1.83e-01
which seems to indicate that there is something wrong because the train_loss seems to get stuck. I tried also to increase the global batch size from 64 to 128 by increasing the batch size to 16 but I get similar outcome. Is this a problem related to DDP and therefore an expected outcome?
Thanks for your help.
Issue Analytics
- State:
- Created 2 years ago
- Comments:21
Top GitHub Comments
We spotted few errors in the recipe that could cause issues with multi-GPU. Let us work a bit on that part 😃
Writing this comment just to help closing this issue.
I have now tested it with the right backward steps per epoch and it seems pretty close to the 1 GPU case. In order to have the same steps when using 8 GPUs I have set
batch_size: 8
andgradient_accumulation: 1
. Here are the results:This is very close to initial results I included in the first comment for 1 GPU (copying results below)
Thanks for your help.