Question on Wav2Vec2 replication using words instead of letters and how/why lexicon?
See original GitHub issue❓ Questions and Help
What is your question?
Firstly, just want to say thank you for making all of the Wav2Vec2 resources and help available, it has made it much easier to replicate your original paper. I have spent the better part of a month replicating everything and going through all the source code in fairseq to get a full understanding of how it all works in relation to the paper. Unfortunately, a few key things are left very much unclear and I was hoping someone could please help me out.
All of the examples and config files (even the pretrained models available for download) all seem to be setup for using letter-based labels in the audio_pretraining task. But is it at all recommended/feasible to train on a word-based label approach instead? Your dictionary would be enormous by contrast, for LibriSpeech, let’s imagine a word vocab size of 50000. While pretraining and even finetuning should be straightforward enough, I imagine the inference (using examples/speech_recognition/infer.py) would be problematic, as Viterbi could not be used (since the state space is now 50000, and it’s O(50000^2)), and while you could probably use the 4-gram KenLM arpa from LibriSpeech’s website (since it’s trained on words, not letters), I’m confused what you would use for a lexicon file in the word case?
I also wanted to ask about KenLM and what exactly the lexicon file is accomplishing and what are the correct ways to generate it. Unfortunately, I can find no documentation anywhere about the lexicon file and how/why it’s necessary, or what is a unit KenLM model (or how you create one) such that you don’t need a lexicon file. The closest thing I have found on how to correctly generate an appropriate lexicon file from a KenLM arpa is this script: https://github.com/facebookresearch/wav2letter/blob/master/recipes/utilities/prepare_librispeech_official_lm.py
And looking at the lexicon file generated, it is of the form:
EVERY E V E R Y |
WORD W O R D |
THAT T H A T |
EXISTS E X I S T S |
IN I N |
YOUR Y O U R |
LABEL L A B E L |
OR O R |
TRANSCRIPTION T R A N S C R I P T I O N |
FILE F I L E |
WILL W I L L |
WRITE W R I T E |
DOWN D O W N |
LIKE L I K E |
THIS T H I S |
So it appears to be some kind of a mapping from words to letters, such that the decoding for letter-based labels in wav2vec2 allows you to utilise the n-gram words but map them back to their letters? In which case, how should the lexicon file look like if you are doing word-based labels instead of letter-based labels? And finally, what is a unit KenLM model, such that you can use the LexiconFreeDecoder, and how do you create such a unit model using the KenLM binaries the the original text corpus?
Finally, how can one utilise the pretrained, LibriSpeech LM Transformer model available for download in the Wav2Letter repo, to decode in infer.py instead of the KenLM model? I ask because that LM Transformer model is a word-based model, and the corresponding dict file has 221k words in it. Does that mean in order to use the LM Transformer model, you need to setup your wav2vec2 finetuning to use word labels instead of letter labels (since the transformer is a word-based model)? And I can see in the W2lFairseqLMDecoder code that the Transformer LM also takes a lexicon file – what does that lexicon file need to look like and how can I generate it? I am assuming it is different in both format and generation from the lexicon file used by KenLM models?
Would greatly appreciate any help.
What have you tried?
What’s your environment?
- fairseq Version (e.g., 1.0 or master): Master
- PyTorch Version (e.g., 1.0): 1.7.1
- OS (e.g., Linux): Linux Ubuntu 16.04
- How you installed fairseq (
pip
, source): Source - Build command you used (if compiling from source): python setup.py bdist_wheel
- Python version: 3.6.12
- CUDA/cuDNN version: 10.2 / 7.6.5
- GPU models and configuration: 4x V100s
- Any other relevant information:
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
that for sure is possible, take a look here: https://hydra.cc/docs/next/tutorials/basic/running_your_app/working_directory/
for sweeps (i.e. running with -m flag) you can set hydra.sweep.dir (and hydra.sweep.subdir)
re: hydra training log being empty - that is an issue with hydra for which I’ve implemented a workaround recently. if you grab the latest code it should now properly log to that file
wav2letter will use whatever you put in your lexicon file to look up in the lm. so it is as you say, if you have lowercase word lm then the first column in the lexicon file should be lowercase. and also as you say, the second column should be using the units from the ctc model