Fine-Tuning Wav2Vec2 with PyTorch DDP
See original GitHub issueEnvironment info
transformers
version: 4.11.0.dev0- Platform: Linux-5.11.0-1017-aws-x86_64-with-glibc2.29
- Python version: 3.8.10
- PyTorch version (GPU?): 1.9.0+cu111 (True)
- Tensorflow version (GPU?): 2.6.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
- Jax version: 0.2.20
- JaxLib version: 0.1.71
- Using GPU in script?: yes (8 or 1)
- Using distributed or parallel set-up in script?: yes
Who can help
Problem:
I’m running some experiments on fine-tuning a pretrained XLSR-Wav2Vec2 model on the Turkish dataset of Common Voice.
The fine-tuning script is an updated version of the existing run_common_voice.py
script that can be seen in this PR: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py
It leverages the Trainer
for CTC training of Wav2Vec2.
I’m running the training script for both distributed training (as follows):
python -m torch.distributed.launch \
--nproc_per_node 8 run_speech_recognition_ctc.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
--dataset_config_name="tr" \
--output_dir="./wav2vec2-large-xlsr-turkish-demo-dist" \
--overwrite_output_dir \
--num_train_epochs="30" \
--per_device_train_batch_size="4" \
--learning_rate="3e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--save_steps="400" \
--eval_steps="100" \
--logging_steps="1" \
--save_total_limit="3" \
--fp16 \
--freeze_feature_extractor \
--chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
--do_train --do_eval
and single-GPU training:
CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
--dataset_name="common_voice" \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
--dataset_config_name="tr" \
--output_dir="./wav2vec2-large-xlsr-turkish-demo" \
--overwrite_output_dir \
--num_train_epochs="30" \
--per_device_train_batch_size="16" \
--gradient_accumulation_steps="2" \
--learning_rate="3e-4" \
--warmup_steps="500" \
--evaluation_strategy="steps" \
--save_steps="400" \
--eval_steps="100" \
--logging_steps="1" \
--save_total_limit="3" \
--freeze_feature_extractor \
--gradient_checkpointing \
--fp16 \
--chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
--do_train --do_eval
As you can see the only difference between the two scripts is that distributed training (DDP) does not use gradient checkpointing and has a “per-GPU” batch size of 4 resulting in an effective batch size of 32, whereas the single GPU training has a “per-GPU” batch size of 16 and uses gradient accumulation of 2 (and gradient checkpointing). So the training scripts are more or less identical in terms of learning rate decay, optimizer, effective batch size, …
Now what is quite surprising to me is that single-GPU training works very well. Here is a report with the most important metrics of the run: https://wandb.ai/patrickvonplaten/huggingface/reports/Wav2Vec2-1-GPU-V100--VmlldzoxMDQwNzI0?accessToken=5xhtxrgy59l7dl2sds08bfk8xq1l30uf1ae0i5lio2r7dpx43vzxufsjmxkkbkig while distributed training doesn’t work at all - here a report of the run: https://wandb.ai/patrickvonplaten/huggingface/reports/Wav2Vec2-DistributedDataParallel-DDP-8-GPU-V100--VmlldzoxMDQwMDU3?accessToken=rsxt5n2s31bfg3kmbtvb982zcqlg8hby7mrjniftnx4n87kephus81zeaj92xfbu
While Wav2Vec2’s CTC loss isn’t super stable the single-GPU script is quite robust to changes in the batch size, learning rate, random seed (I’ve tried a bunch of slight changes and the script always manages to push the training/eval loss below 1 and yield a reasonable word error rate in the beginning. On the other hand the distributed script doesn’t seem to work at all (tried out a variety of dropout rates, learning rates, batch sizes, layerdrop, …) -> none of them converge.
That’s quite surprising to me as the scripts should in theory be more or less the same.
Some possible reasons I thought could be:
-
In distributed training the gradients are computed for each process/gpu separately and then averaged (reduced). However this is slighly different to single GPU training because of the following: For a single GPU, each input sample in the batch can have a different number of losses, e.g. if the labels are
[["Hello my name is"], ["hey <pad> <pad> <pad>"]]
then on a single GPU the loss is correctly averaged (5 words = 5 losses = sum(losses) / 5). However in DDP the losses are averaged locally and then the gradients are averaged globally which would mean (1st GPU: 4 words = 4 loss & 2nd GPU: 1 word = 1 loss => (sum(losses_gpu1) / 4 + sum(losses_gpu2) / 1) / 2 which is not the same as on a single GPU. However, I’ve also played around withgroup_by_length
- in this case the inputs per batch should be roughly similar and there shouldn’t be a problem, but that didn’t help either. I’ve also summed the losses and scaled the gradients correctly - see: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L286 which also didn’t help. Related issue/discussion on PyTorch: https://discuss.pytorch.org/t/average-loss-in-dp-and-ddp/93306 -
Could gradient-checkpointing be the reason? I really don’t see how this could make a difference though…
-
Could gradient accumulation be the reason? I also tried out using gradient accumulation of 2 in distributed training and batch size 2 per GPU for DDP which didn’t help either.
-
Another specialty about Wav2Vec2 is that the first 7 Conv layers are frozen here: https://github.com/huggingface/transformers/blob/ea92136597c49a20c5e2c31ef20ccec1693a8858/examples/research_projects/wav2vec2/run_common_voice.py#L454 which calls this function: https://github.com/huggingface/transformers/blob/ea92136597c49a20c5e2c31ef20ccec1693a8858/src/transformers/models/wav2vec2/modeling_wav2vec2.py#L1416. Could it be that in DDP just one of the eight models does that, but not the rest? (Actually I could easily check this…)
-
Could it be that fp16 amp scaling somehow messes differently in DDP than in single GPU training?
-
… other possible reasons?
@stas00 @sgugger - Have you previously heard about this kind of problem before (that single GPU works but DDP doesn’t?). Think it’s very hard to debug or dive into this problem, but I thought maybe you have some useful next step debugging strategies or tips!
@anton-l - have you used DDP training during the Wav2Vec2 sprint? I’ve pretty much only used single GPU training which works well, but not DDP… have you had similar problems before?
I’ve also tried running DDP on other/bigger datasets without success, so I’m a bit confused why it doesn’t work here. Think the CTCLoss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html is quite special and definitely more prone to instabilities than simple CE loss, but it’s still very surprising to me that I get single GPU rather easily working but DDP not at all.
Some things I was planning on trying out next:
- Running DDP with just 2 GPUs to see whether the more GPUs, the more instable the loss becomes or not…
Do you guys have maybe any other good debugging strategies ?
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (8 by maintainers)
Issues is solved. Will post a more detailed reason as an explanation
Yeah, in the training runs above, I actually disabled this (I just use
Trainer
instead ofCTCTrainer
). I copied that code more or less from fairseq’s Trainer. The idea here to only usectc_loss_reduction="mean"
in single GPU setup, but then usectc_loss_reduction="sum"
in the DDP setup and sum all losses and later scale the gradients correctly.With this code, all local copies of the model will get a gradient d(sum(loss_1)) /d(params) -> so that the average reduced gradient is d(sum(loss_1) + sum(loss_2) + …)/8 * d(params)) with 8 being the world_size. Then I multiply by 8 and divide by all losses (batch_size * seq_length of gpu1 + batch_size * seq_length of gpu_2 + …) which should then make the gradients identical to
ctc_loss_reduction="mean"
in single GPU setup.I tried this out a couple of times, but it didn’t solve the problem and also given that the sequence lengths in common voice are quite similar just using
ctc_loss_reduction="mean"
should work fine as well (see the first possible reason for a bug above).=> So in short I currently don’t use that code as I just use
Trainer
instead ofCTCTrainer
(sorry should probably have commented outCTCTrainer
completely - see: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L511