Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wav2vec2.0 unsupervised finetuning.

See original GitHub issue

❓ Questions and Help

What is your question?

Hey there, I have a question regarding the unsupervised fine-tuning of the wav2vec2.0 models. As expected, the results that the English-pretrained model achieves in different languages are not that groundbreaking out-of-the-box, at least for the small model pretrained on Libri. In the readme, you provide an example of how to either train a new model completely from scratch or finetune a pretrained model with CTC on labeled data. While the latter works really well and achieves satisfying results on English datasets such as Common Voice or Voxforge with a fraction of labeled data that would be normally required, the result I got on a different language (Polish) with completely different phonetics are not that good. So naturally, the first thing I want to try is to adapt the domain of the unsupervised model to “get used to” the sound of Polish speech. While I could try to train such a model from scratch, in the paper you mention that it took 1.6 days on 64 V100 GPUS, so I imagine that in order to get the satisfying quality I would need to train for at least a week on 4 RTX 2080Ti I have available and that’s something I cannot really afford at the moment. That’s why I wanted to try to finetune the existing model on the target domain, hoping that this way I could improve results on polish data with a fraction of training time.

Soooo my questions are:

Do you think it is a good idea to finetune the unsupervised model in an unsupervised way on a new domain or would you rather train it completely from scratch?
Are there any caveats to watch out for? Like can LR scheduler mess up the training or something? I have limited experience training models on this scale.

What have you tried?

I have already launched the procedure by renaming the wav2vec_small to checkpoint_last.pt and starting from that directory as the --save-dir. However, I had to pass the --reset-optimizer flag, because apparently, the Criterions did not match (the code u have in readme uses --criterion wav2vec, however, the loaded checkpoint had BinaryCrossEntropyCriterion for some reason.

What’s your environment?

fairseq Version: commit 83c39c41388f2e7ba37647d2e8a0cbc97f6f8032

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:15 (6 by maintainers)

Top GitHub Comments

2reactions

xiewenjing1170commented, Jun 21, 2021

@kwasnydam @alexeib Hi, thanks for your sharing. I am trying to unsupervised fine-tune wav2vec2.0 model on my own speech dataset. But the accuracy performance is really bad. I am really confused that if it is reasonable to fine-tune wav2vec2.0 model in an unsupervised way? In other words, if wav2vec2.0 and other similar pre-trained models are only should be only used as static pre-trained models, or be fine-tuned in a supervised way. If these models can not work well when unsupervised fine-tuned on target dataset. Thanks for your help!

1reaction

alexeibcommented, Nov 19, 2020

<s> is interpreted by fairseq as a “beginning of sentence” token, which i have hijacked to use a ctc blank token. so if you want to decode a ctc output then you collapse consecutive duplicates, then remove blanks. | is a word boundary token (as defined by lexicon/training data - nothing special about it code wise except during eval)