Conflict between pyctcdecode and Wav2Vec2ProcessorWithLM
See original GitHub issueSystem Info
transformers 4975002df50c472cbb6f8ac3580e475f570606ab pyctcdecode 9afead58560df07c021aa01285cd941f70fe93d5
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Error:
The tokens {'', '⁇', ' '} are defined in the tokenizer's vocabulary, but not in the decoder's alphabet. Make sure to include {'', '⁇', ' '} in the decoder's alphabet.
Reason:
get_missing_alphabet_tokens
will replace special tokens
however if we build_ctcdecoder
using the same tokenizer vocab, it will be always mismatch.
Expected behavior
A straight fix is do the same mapping on build_ctcdecoder
from transformers import AutoProcessor
from pyctcdecode.alphabet import BLANK_TOKEN_PTN, UNK_TOKEN, UNK_TOKEN_PTN, Alphabet
from pyctcdecode import build_ctcdecoder
from transformers import Wav2Vec2ProcessorWithLM
model_to_add_lm = "wav2vec2-large-xxxxx"
lm_arpa_path = "xxxxx.arpa"
processor = AutoProcessor.from_pretrained(model_to_add_lm)
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}
alphabet = list(sorted_vocab_dict.keys())
for i, token in enumerate(alphabet):
if BLANK_TOKEN_PTN.match(token):
alphabet[i] = ""
if token == processor.tokenizer.word_delimiter_token:
alphabet[i] = " "
if UNK_TOKEN_PTN.match(token):
alphabet[i] = UNK_TOKEN
decoder = build_ctcdecoder(
labels=alphabet,
kenlm_model_path=lm_arpa_path,
)
decoder._alphabet._labels = alphabet
processor_with_lm = Wav2Vec2ProcessorWithLM(
feature_extractor=processor.feature_extractor,
tokenizer=processor.tokenizer,
decoder=decoder
)
processor_with_lm.save_pretrained("xxxxxx")
Issue Analytics
- State:
- Created a year ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Crash on google colab #20526 - huggingface/transformers
pyctcdecode ==0.4.0 ... from transformers import Wav2Vec2ProcessorWithLM ... I really dont know is that a conflict by cpu and gpu???
Read more >Pipelines - Hugging Face
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex...
Read more >notebook4317d7a022 - Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.
Read more >pyctcdecode — A new beam search decoder for CTC speech ...
pyctcdecode is a library providing fast and feature-rich beam search decoding for speech recognition with Connectionist Temporal ...
Read more >Pyctcdecode & Speech2text decoding - YouTube
Read everything about the Speech Challenge: ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @voidful, sorry about the delayed reply. I’ve taken a deeper look into your issue - it looks as though there is a mis-match between the tokeniser and LM’s vocabularies (12305 tokens to be exact): https://colab.research.google.com/drive/1v1qd4CUdSXKmrSYIMqMzMk_KCUMfMWu9?usp=sharing
For LM boosted beam-search decoding for CTC, we need the vocabulary of the LM to match that of the tokeniser one-to-one. You can ensure this by training your LM using the same method that you use to train the Wav2Vec2 tokeniser. You then shouldn’t have to override the method
decoder._alphabel.labels
: the vocabularies should already match (barring the special tokens).See this example for creating a tokeniser: https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/get_ctc_tokenizer.py
And this example for creating a corresponding LM: https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/get_ctc_ngram.py
This blog also explains succinctly how one can train and instantiate an LM: https://huggingface.co/blog/wav2vec2-with-ngram
Maybe of interest to @patrickvonplaten @anton-l @sanchit-gandhi**