question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wav2vec2.0 unsupervised finetuning.

See original GitHub issue

❓ Questions and Help

What is your question?

Hey there, I have a question regarding the unsupervised fine-tuning of the wav2vec2.0 models. As expected, the results that the English-pretrained model achieves in different languages are not that groundbreaking out-of-the-box, at least for the small model pretrained on Libri. In the readme, you provide an example of how to either train a new model completely from scratch or finetune a pretrained model with CTC on labeled data. While the latter works really well and achieves satisfying results on English datasets such as Common Voice or Voxforge with a fraction of labeled data that would be normally required, the result I got on a different language (Polish) with completely different phonetics are not that good. So naturally, the first thing I want to try is to adapt the domain of the unsupervised model to “get used to” the sound of Polish speech. While I could try to train such a model from scratch, in the paper you mention that it took 1.6 days on 64 V100 GPUS, so I imagine that in order to get the satisfying quality I would need to train for at least a week on 4 RTX 2080Ti I have available and that’s something I cannot really afford at the moment. That’s why I wanted to try to finetune the existing model on the target domain, hoping that this way I could improve results on polish data with a fraction of training time.

Soooo my questions are:

  1. Do you think it is a good idea to finetune the unsupervised model in an unsupervised way on a new domain or would you rather train it completely from scratch?
  2. Are there any caveats to watch out for? Like can LR scheduler mess up the training or something? I have limited experience training models on this scale.

What have you tried?

I have already launched the procedure by renaming the wav2vec_small to checkpoint_last.pt and starting from that directory as the --save-dir. However, I had to pass the --reset-optimizer flag, because apparently, the Criterions did not match (the code u have in readme uses --criterion wav2vec, however, the loaded checkpoint had BinaryCrossEntropyCriterion for some reason.

What’s your environment?

  • fairseq Version: commit 83c39c41388f2e7ba37647d2e8a0cbc97f6f8032

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
xiewenjing1170commented, Jun 21, 2021

@kwasnydam @alexeib Hi, thanks for your sharing. I am trying to unsupervised fine-tune wav2vec2.0 model on my own speech dataset. But the accuracy performance is really bad. I am really confused that if it is reasonable to fine-tune wav2vec2.0 model in an unsupervised way? In other words, if wav2vec2.0 and other similar pre-trained models are only should be only used as static pre-trained models, or be fine-tuned in a supervised way. If these models can not work well when unsupervised fine-tuned on target dataset. Thanks for your help!

1reaction
alexeibcommented, Nov 19, 2020

<s> is interpreted by fairseq as a “beginning of sentence” token, which i have hijacked to use a ctc blank token. so if you want to decode a ctc output then you collapse consecutive duplicates, then remove blanks. | is a word boundary token (as defined by lexicon/training data - nothing special about it code wise except during eval)

Read more comments on GitHub >

github_iconTop Results From Across the Web

wav2vec2.0 unsupervised finetuning. · Issue #2885 - GitHub
Hey there, I have a question regarding the unsupervised fine-tuning of the wav2vec2.0 models. As expected, the results that the ...
Read more >
Fine-tuning Wav2vec2.0 on caption data
Utilizing an unsupervised compression model to post-edit the tran- scribed speech and create readable subtitles. • Modeling the length ...
Read more >
Fine-Tune Wav2Vec2 for English ASR with Transformers
Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in September 2020 by Alexei Baevski, Michael Auli, ...
Read more >
Exploring Wav2vec 2.0 fine-tuning for improved speech ...
Zhan (2021) A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis. In Proc. Interspeech 2021, pp.
Read more >
arXiv:2201.08930v1 [eess.AS] 22 Jan 2022
Index Terms— Wav2vec2.0, speech recognition, noise robust- ... use the same experimental data as in [17] for unsupervised pre-.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found