Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fine-Tuning Wav2Vec2 with PyTorch DDP

See original GitHub issue

Environment info

transformers version: 4.11.0.dev0
Platform: Linux-5.11.0-1017-aws-x86_64-with-glibc2.29
Python version: 3.8.10
PyTorch version (GPU?): 1.9.0+cu111 (True)
Tensorflow version (GPU?): 2.6.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
Jax version: 0.2.20
JaxLib version: 0.1.71
Using GPU in script?: yes (8 or 1)
Using distributed or parallel set-up in script?: yes

Who can help

@sgugger @stas00 @anton-l

Problem:

I’m running some experiments on fine-tuning a pretrained XLSR-Wav2Vec2 model on the Turkish dataset of Common Voice.

The fine-tuning script is an updated version of the existing run_common_voice.py script that can be seen in this PR: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py

It leverages the Trainer for CTC training of Wav2Vec2.

I’m running the training script for both distributed training (as follows):

python -m torch.distributed.launch \
        --nproc_per_node 8 run_speech_recognition_ctc.py \
        --dataset_name="common_voice" \
        --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
        --dataset_config_name="tr" \
        --output_dir="./wav2vec2-large-xlsr-turkish-demo-dist" \
        --overwrite_output_dir \
        --num_train_epochs="30" \
        --per_device_train_batch_size="4" \
        --learning_rate="3e-4" \
        --warmup_steps="500" \
        --evaluation_strategy="steps" \
        --save_steps="400" \
        --eval_steps="100" \
        --logging_steps="1" \
        --save_total_limit="3" \
        --fp16 \
        --freeze_feature_extractor \
        --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
        --do_train --do_eval

and single-GPU training:

CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
        --dataset_name="common_voice" \
        --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
        --dataset_config_name="tr" \
        --output_dir="./wav2vec2-large-xlsr-turkish-demo" \
        --overwrite_output_dir \
        --num_train_epochs="30" \
        --per_device_train_batch_size="16" \
        --gradient_accumulation_steps="2" \
        --learning_rate="3e-4" \
        --warmup_steps="500" \
        --evaluation_strategy="steps" \
        --save_steps="400" \
        --eval_steps="100" \
        --logging_steps="1" \
        --save_total_limit="3" \
        --freeze_feature_extractor \
        --gradient_checkpointing \
        --fp16 \
        --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
        --do_train --do_eval

As you can see the only difference between the two scripts is that distributed training (DDP) does not use gradient checkpointing and has a “per-GPU” batch size of 4 resulting in an effective batch size of 32, whereas the single GPU training has a “per-GPU” batch size of 16 and uses gradient accumulation of 2 (and gradient checkpointing). So the training scripts are more or less identical in terms of learning rate decay, optimizer, effective batch size, …

Now what is quite surprising to me is that single-GPU training works very well. Here is a report with the most important metrics of the run: https://wandb.ai/patrickvonplaten/huggingface/reports/Wav2Vec2-1-GPU-V100--VmlldzoxMDQwNzI0?accessToken=5xhtxrgy59l7dl2sds08bfk8xq1l30uf1ae0i5lio2r7dpx43vzxufsjmxkkbkig while distributed training doesn’t work at all - here a report of the run: https://wandb.ai/patrickvonplaten/huggingface/reports/Wav2Vec2-DistributedDataParallel-DDP-8-GPU-V100--VmlldzoxMDQwMDU3?accessToken=rsxt5n2s31bfg3kmbtvb982zcqlg8hby7mrjniftnx4n87kephus81zeaj92xfbu

While Wav2Vec2’s CTC loss isn’t super stable the single-GPU script is quite robust to changes in the batch size, learning rate, random seed (I’ve tried a bunch of slight changes and the script always manages to push the training/eval loss below 1 and yield a reasonable word error rate in the beginning. On the other hand the distributed script doesn’t seem to work at all (tried out a variety of dropout rates, learning rates, batch sizes, layerdrop, …) -> none of them converge.

That’s quite surprising to me as the scripts should in theory be more or less the same.

Some possible reasons I thought could be:

In distributed training the gradients are computed for each process/gpu separately and then averaged (reduced). However this is slighly different to single GPU training because of the following: For a single GPU, each input sample in the batch can have a different number of losses, e.g. if the labels are [["Hello my name is"], ["hey <pad> <pad> <pad>"]] then on a single GPU the loss is correctly averaged (5 words = 5 losses = sum(losses) / 5). However in DDP the losses are averaged locally and then the gradients are averaged globally which would mean (1st GPU: 4 words = 4 loss & 2nd GPU: 1 word = 1 loss => (sum(losses_gpu1) / 4 + sum(losses_gpu2) / 1) / 2 which is not the same as on a single GPU. However, I’ve also played around with group_by_length - in this case the inputs per batch should be roughly similar and there shouldn’t be a problem, but that didn’t help either. I’ve also summed the losses and scaled the gradients correctly - see: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L286 which also didn’t help. Related issue/discussion on PyTorch: https://discuss.pytorch.org/t/average-loss-in-dp-and-ddp/93306
Could gradient-checkpointing be the reason? I really don’t see how this could make a difference though…
Could gradient accumulation be the reason? I also tried out using gradient accumulation of 2 in distributed training and batch size 2 per GPU for DDP which didn’t help either.
Another specialty about Wav2Vec2 is that the first 7 Conv layers are frozen here: https://github.com/huggingface/transformers/blob/ea92136597c49a20c5e2c31ef20ccec1693a8858/examples/research_projects/wav2vec2/run_common_voice.py#L454 which calls this function: https://github.com/huggingface/transformers/blob/ea92136597c49a20c5e2c31ef20ccec1693a8858/src/transformers/models/wav2vec2/modeling_wav2vec2.py#L1416. Could it be that in DDP just one of the eight models does that, but not the rest? (Actually I could easily check this…)
Could it be that fp16 amp scaling somehow messes differently in DDP than in single GPU training?
… other possible reasons?

@stas00 @sgugger - Have you previously heard about this kind of problem before (that single GPU works but DDP doesn’t?). Think it’s very hard to debug or dive into this problem, but I thought maybe you have some useful next step debugging strategies or tips!

@anton-l - have you used DDP training during the Wav2Vec2 sprint? I’ve pretty much only used single GPU training which works well, but not DDP… have you had similar problems before?

I’ve also tried running DDP on other/bigger datasets without success, so I’m a bit confused why it doesn’t work here. Think the CTCLoss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html is quite special and definitely more prone to instabilities than simple CE loss, but it’s still very surprising to me that I get single GPU rather easily working but DDP not at all.

Some things I was planning on trying out next:

Running DDP with just 2 GPUs to see whether the more GPUs, the more instable the loss becomes or not…

Do you guys have maybe any other good debugging strategies ?

Issue Analytics

State:
Created 2 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

patrickvonplatencommented, Sep 22, 2021

Issues is solved. Will post a more detailed reason as an explanation

1reaction

patrickvonplatencommented, Sep 21, 2021

Yeah, in the training runs above, I actually disabled this (I just use Trainer instead of CTCTrainer). I copied that code more or less from fairseq’s Trainer. The idea here to only use ctc_loss_reduction="mean" in single GPU setup, but then use ctc_loss_reduction="sum" in the DDP setup and sum all losses and later scale the gradients correctly.

With this code, all local copies of the model will get a gradient d(sum(loss_1)) /d(params) -> so that the average reduced gradient is d(sum(loss_1) + sum(loss_2) + …)/8 * d(params)) with 8 being the world_size. Then I multiply by 8 and divide by all losses (batch_size * seq_length of gpu1 + batch_size * seq_length of gpu_2 + …) which should then make the gradients identical to ctc_loss_reduction="mean" in single GPU setup.

I tried this out a couple of times, but it didn’t solve the problem and also given that the sequence lengths in common voice are quite similar just using ctc_loss_reduction="mean" should work fine as well (see the first possible reason for a bug above).

=> So in short I currently don’t use that code as I just use Trainer instead of CTCTrainer (sorry should probably have commented out CTCTrainer completely - see: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L511