Evaluating Wav2vec 2.0 with Transformer LM
See original GitHub issueWhat is your question? How to reproduce the WER improvement obtained by using the proposed Transformer LM instead of Viterbi ?
What have you tried? Files used:
-
Letter dictionary: here from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
-
Wav2vec model: here from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md
-
Transformer LM: here from https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019
-
LM dict: here + upper-case processing from https://github.com/facebookresearch/wav2letter/tree/master/recipes/sota/2019 (
dict.txt
placed in the same directory thanlm_librispeech_word_transformer.pt
)
head -3 dict.txt
THE 49059384
AND 26362574
OF 24795903
Command used:
python examples/speech_recognition/infer.py /path/to/librispeech --task audio_pretraining --nbest 1 --path /path/to/wav2vec2_vox_960h.pt --gen-subset dev_clean --results-path outputdir --w2l-decoder fairseqlm --lm-model /path/to/lm_librispeech_word_transformer.pt --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000
This produces a WER > 50 while using Viterbi gives ~2 WER.
When using this lexicon file (from https://github.com/pytorch/fairseq/issues/2734) by adding this argument --lexicon /path/to/librispeech_lexicon.lst
, I get a ~6 WER.
What’s your environment?
-
fairseq 0.10.0 (latest stable release)
-
wav2letter branch v0.2 for python bindings + patch from this issue https://github.com/facebookresearch/wav2letter/issues/775 (otherwise imports from w2l_decoder.py will fail due to missing LexiconFreeDecoder)
I don’t know what I did wrong. Thank you fro your answer !
Issue Analytics
- State:
- Created 3 years ago
- Reactions:8
- Comments:5
Top GitHub Comments
Hi, To recreate the results, I noticed that LM weight, word insertion penalty and beam size also play an important role. They have used a variety of values based on the finetune data, transformer/Ken LM and the set they are decoding. Please refer to the paper, the ablations section will have the values they used for different experiments. I had followed this for the 1hr BASE finetuned model with both 4-gram KENLM and transformer LM I got their results. With LM weight 2 and Word Insertion Penalty, I was getting around 10-15% more .
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!