wav2vec2.0 unsupervised finetuning.
See original GitHub issue❓ Questions and Help
What is your question?
Hey there, I have a question regarding the unsupervised fine-tuning of the wav2vec2.0 models. As expected, the results that the English-pretrained model achieves in different languages are not that groundbreaking out-of-the-box, at least for the small model pretrained on Libri. In the readme, you provide an example of how to either train a new model completely from scratch or finetune a pretrained model with CTC on labeled data. While the latter works really well and achieves satisfying results on English datasets such as Common Voice or Voxforge with a fraction of labeled data that would be normally required, the result I got on a different language (Polish) with completely different phonetics are not that good. So naturally, the first thing I want to try is to adapt the domain of the unsupervised model to “get used to” the sound of Polish speech. While I could try to train such a model from scratch, in the paper you mention that it took 1.6 days on 64 V100 GPUS, so I imagine that in order to get the satisfying quality I would need to train for at least a week on 4 RTX 2080Ti I have available and that’s something I cannot really afford at the moment. That’s why I wanted to try to finetune the existing model on the target domain, hoping that this way I could improve results on polish data with a fraction of training time.
Soooo my questions are:
- Do you think it is a good idea to finetune the unsupervised model in an unsupervised way on a new domain or would you rather train it completely from scratch?
- Are there any caveats to watch out for? Like can LR scheduler mess up the training or something? I have limited experience training models on this scale.
What have you tried?
I have already launched the procedure by renaming the wav2vec_small to checkpoint_last.pt
and starting from that directory as the --save-dir
. However, I had to pass the --reset-optimizer
flag, because apparently, the Criterions did not match (the code u have in readme uses --criterion wav2vec
, however, the loaded checkpoint had BinaryCrossEntropyCriterion
for some reason.
What’s your environment?
- fairseq Version: commit 83c39c41388f2e7ba37647d2e8a0cbc97f6f8032
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:15 (6 by maintainers)
Top GitHub Comments
@kwasnydam @alexeib Hi, thanks for your sharing. I am trying to unsupervised fine-tune wav2vec2.0 model on my own speech dataset. But the accuracy performance is really bad. I am really confused that if it is reasonable to fine-tune wav2vec2.0 model in an unsupervised way? In other words, if wav2vec2.0 and other similar pre-trained models are only should be only used as static pre-trained models, or be fine-tuned in a supervised way. If these models can not work well when unsupervised fine-tuned on target dataset. Thanks for your help!
<s>
is interpreted by fairseq as a “beginning of sentence” token, which i have hijacked to use a ctc blank token. so if you want to decode a ctc output then you collapse consecutive duplicates, then remove blanks. | is a word boundary token (as defined by lexicon/training data - nothing special about it code wise except during eval)