Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conflict between pyctcdecode and Wav2Vec2ProcessorWithLM

See original GitHub issue

System Info

transformers 4975002df50c472cbb6f8ac3580e475f570606ab pyctcdecode 9afead58560df07c021aa01285cd941f70fe93d5

Who can help?

@patrici

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Error: The tokens {'', '⁇', ' '} are defined in the tokenizer's vocabulary, but not in the decoder's alphabet. Make sure to include {'', '⁇', ' '} in the decoder's alphabet.

Reason: get_missing_alphabet_tokens will replace special tokens

https://github.com/huggingface/transformers/blob/4975002df50c472cbb6f8ac3580e475f570606ab/src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py#L196

however if we build_ctcdecoder using the same tokenizer vocab, it will be always mismatch.

Expected behavior

A straight fix is do the same mapping on build_ctcdecoder

from transformers import AutoProcessor
from pyctcdecode.alphabet import BLANK_TOKEN_PTN, UNK_TOKEN, UNK_TOKEN_PTN, Alphabet
from pyctcdecode import build_ctcdecoder
from transformers import Wav2Vec2ProcessorWithLM

model_to_add_lm = "wav2vec2-large-xxxxx"
lm_arpa_path = "xxxxx.arpa"

processor = AutoProcessor.from_pretrained(model_to_add_lm)
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}
alphabet = list(sorted_vocab_dict.keys())

for i, token in enumerate(alphabet):
    if BLANK_TOKEN_PTN.match(token):
        alphabet[i] = ""
    if token == processor.tokenizer.word_delimiter_token:
        alphabet[i] = " "
    if UNK_TOKEN_PTN.match(token):
        alphabet[i] = UNK_TOKEN

decoder = build_ctcdecoder(
    labels=alphabet,
    kenlm_model_path=lm_arpa_path,
)

decoder._alphabet._labels = alphabet
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

processor_with_lm.save_pretrained("xxxxxx")

Issue Analytics

State:
Created a year ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

sanchit-gandhicommented, Oct 10, 2022

Hey @voidful, sorry about the delayed reply. I’ve taken a deeper look into your issue - it looks as though there is a mis-match between the tokeniser and LM’s vocabularies (12305 tokens to be exact): https://colab.research.google.com/drive/1v1qd4CUdSXKmrSYIMqMzMk_KCUMfMWu9?usp=sharing

For LM boosted beam-search decoding for CTC, we need the vocabulary of the LM to match that of the tokeniser one-to-one. You can ensure this by training your LM using the same method that you use to train the Wav2Vec2 tokeniser. You then shouldn’t have to override the method decoder._alphabel.labels: the vocabularies should already match (barring the special tokens).

See this example for creating a tokeniser: https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/get_ctc_tokenizer.py

And this example for creating a corresponding LM: https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/get_ctc_ngram.py

This blog also explains succinctly how one can train and instantiate an LM: https://huggingface.co/blog/wav2vec2-with-ngram

1reaction

LysandreJikcommented, Jul 21, 2022

Maybe of interest to @patrickvonplaten @anton-l @sanchit-gandhi**