question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fine-Tuning Wav2Vec2 with PyTorch DDP

See original GitHub issue

Environment info

  • transformers version: 4.11.0.dev0
  • Platform: Linux-5.11.0-1017-aws-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.9.0+cu111 (True)
  • Tensorflow version (GPU?): 2.6.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.4 (cpu)
  • Jax version: 0.2.20
  • JaxLib version: 0.1.71
  • Using GPU in script?: yes (8 or 1)
  • Using distributed or parallel set-up in script?: yes

Who can help

@sgugger @stas00 @anton-l

Problem:

I’m running some experiments on fine-tuning a pretrained XLSR-Wav2Vec2 model on the Turkish dataset of Common Voice.

The fine-tuning script is an updated version of the existing run_common_voice.py script that can be seen in this PR: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py

It leverages the Trainer for CTC training of Wav2Vec2.

I’m running the training script for both distributed training (as follows):

python -m torch.distributed.launch \
        --nproc_per_node 8 run_speech_recognition_ctc.py \
        --dataset_name="common_voice" \
        --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
        --dataset_config_name="tr" \
        --output_dir="./wav2vec2-large-xlsr-turkish-demo-dist" \
        --overwrite_output_dir \
        --num_train_epochs="30" \
        --per_device_train_batch_size="4" \
        --learning_rate="3e-4" \
        --warmup_steps="500" \
        --evaluation_strategy="steps" \
        --save_steps="400" \
        --eval_steps="100" \
        --logging_steps="1" \
        --save_total_limit="3" \
        --fp16 \
        --freeze_feature_extractor \
        --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
        --do_train --do_eval

and single-GPU training:

CUDA_VISIBLE_DEVICES="0" python run_speech_recognition_ctc.py \
        --dataset_name="common_voice" \
        --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
        --dataset_config_name="tr" \
        --output_dir="./wav2vec2-large-xlsr-turkish-demo" \
        --overwrite_output_dir \
        --num_train_epochs="30" \
        --per_device_train_batch_size="16" \
        --gradient_accumulation_steps="2" \
        --learning_rate="3e-4" \
        --warmup_steps="500" \
        --evaluation_strategy="steps" \
        --save_steps="400" \
        --eval_steps="100" \
        --logging_steps="1" \
        --save_total_limit="3" \
        --freeze_feature_extractor \
        --gradient_checkpointing \
        --fp16 \
        --chars_to_ignore , ? . ! - \; \: \" “ % ‘ ” � \
        --do_train --do_eval

As you can see the only difference between the two scripts is that distributed training (DDP) does not use gradient checkpointing and has a “per-GPU” batch size of 4 resulting in an effective batch size of 32, whereas the single GPU training has a “per-GPU” batch size of 16 and uses gradient accumulation of 2 (and gradient checkpointing). So the training scripts are more or less identical in terms of learning rate decay, optimizer, effective batch size, …

Now what is quite surprising to me is that single-GPU training works very well. Here is a report with the most important metrics of the run: https://wandb.ai/patrickvonplaten/huggingface/reports/Wav2Vec2-1-GPU-V100--VmlldzoxMDQwNzI0?accessToken=5xhtxrgy59l7dl2sds08bfk8xq1l30uf1ae0i5lio2r7dpx43vzxufsjmxkkbkig while distributed training doesn’t work at all - here a report of the run: https://wandb.ai/patrickvonplaten/huggingface/reports/Wav2Vec2-DistributedDataParallel-DDP-8-GPU-V100--VmlldzoxMDQwMDU3?accessToken=rsxt5n2s31bfg3kmbtvb982zcqlg8hby7mrjniftnx4n87kephus81zeaj92xfbu

While Wav2Vec2’s CTC loss isn’t super stable the single-GPU script is quite robust to changes in the batch size, learning rate, random seed (I’ve tried a bunch of slight changes and the script always manages to push the training/eval loss below 1 and yield a reasonable word error rate in the beginning. On the other hand the distributed script doesn’t seem to work at all (tried out a variety of dropout rates, learning rates, batch sizes, layerdrop, …) -> none of them converge.

That’s quite surprising to me as the scripts should in theory be more or less the same.

Some possible reasons I thought could be:

@stas00 @sgugger - Have you previously heard about this kind of problem before (that single GPU works but DDP doesn’t?). Think it’s very hard to debug or dive into this problem, but I thought maybe you have some useful next step debugging strategies or tips!

@anton-l - have you used DDP training during the Wav2Vec2 sprint? I’ve pretty much only used single GPU training which works well, but not DDP… have you had similar problems before?

I’ve also tried running DDP on other/bigger datasets without success, so I’m a bit confused why it doesn’t work here. Think the CTCLoss: https://pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html is quite special and definitely more prone to instabilities than simple CE loss, but it’s still very surprising to me that I get single GPU rather easily working but DDP not at all.

Some things I was planning on trying out next:

  • Running DDP with just 2 GPUs to see whether the more GPUs, the more instable the loss becomes or not…

Do you guys have maybe any other good debugging strategies ?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
patrickvonplatencommented, Sep 22, 2021

Issues is solved. Will post a more detailed reason as an explanation

1reaction
patrickvonplatencommented, Sep 21, 2021

Yeah, in the training runs above, I actually disabled this (I just use Trainer instead of CTCTrainer). I copied that code more or less from fairseq’s Trainer. The idea here to only use ctc_loss_reduction="mean" in single GPU setup, but then use ctc_loss_reduction="sum" in the DDP setup and sum all losses and later scale the gradients correctly.

With this code, all local copies of the model will get a gradient d(sum(loss_1)) /d(params) -> so that the average reduced gradient is d(sum(loss_1) + sum(loss_2) + …)/8 * d(params)) with 8 being the world_size. Then I multiply by 8 and divide by all losses (batch_size * seq_length of gpu1 + batch_size * seq_length of gpu_2 + …) which should then make the gradients identical to ctc_loss_reduction="mean" in single GPU setup.

I tried this out a couple of times, but it didn’t solve the problem and also given that the sequence lengths in common voice are quite similar just using ctc_loss_reduction="mean" should work fine as well (see the first possible reason for a bug above).

=> So in short I currently don’t use that code as I just use Trainer instead of CTCTrainer (sorry should probably have commented out CTCTrainer completely - see: https://github.com/huggingface/transformers/blob/97936d3aacc04f6253ff178415b8a57768fc8ce6/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L511

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wav2vec fine-tuning with multiGPU - Hugging Face Forums
I'm fine-tuning wav2vec model with Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with Transformers at local machine with 4xT4 GPU (16Gb)
Read more >
No decrease of wer when fine tuning wav2vec 2.0 · Issue #2685
I am trying to replicate the paper by fine-tuning the Wa2Vec 2.0 No ... finetune model according to https://github.com/pytorch/fairseq/blob ...
Read more >
Fine-tuning Wav2Vec for Speech Recognition with Lightning ...
Wav2Vec 2.0 is a popular semi-supervised audio model that has shown impressive results when fine-tuned to downstream tasks, such as Speech ...
Read more >
Training “real-world” models with DDP - PyTorch
In this video, we will review the process of training a GPT model in multinode DDP. We first clone the minGPT repo and...
Read more >
Train your deep learning models faster with OVHcloud - Medium
In this tutorial, we will finetune a Wav2vec2 model using a CTC Head from the librspeech ... Explaining DP from Pytorch Lightning team....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found