question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conflict between pyctcdecode and Wav2Vec2ProcessorWithLM

See original GitHub issue

System Info

transformers 4975002df50c472cbb6f8ac3580e475f570606ab pyctcdecode 9afead58560df07c021aa01285cd941f70fe93d5

Who can help?

@patrici

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Error: The tokens {'', '⁇', ' '} are defined in the tokenizer's vocabulary, but not in the decoder's alphabet. Make sure to include {'', '⁇', ' '} in the decoder's alphabet.

Reason: get_missing_alphabet_tokens will replace special tokens

https://github.com/huggingface/transformers/blob/4975002df50c472cbb6f8ac3580e475f570606ab/src/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py#L196

however if we build_ctcdecoder using the same tokenizer vocab, it will be always mismatch.

Expected behavior

A straight fix is do the same mapping on build_ctcdecoder

from transformers import AutoProcessor
from pyctcdecode.alphabet import BLANK_TOKEN_PTN, UNK_TOKEN, UNK_TOKEN_PTN, Alphabet
from pyctcdecode import build_ctcdecoder
from transformers import Wav2Vec2ProcessorWithLM

model_to_add_lm = "wav2vec2-large-xxxxx"
lm_arpa_path = "xxxxx.arpa"

processor = AutoProcessor.from_pretrained(model_to_add_lm)
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k: v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}
alphabet = list(sorted_vocab_dict.keys())

for i, token in enumerate(alphabet):
    if BLANK_TOKEN_PTN.match(token):
        alphabet[i] = ""
    if token == processor.tokenizer.word_delimiter_token:
        alphabet[i] = " "
    if UNK_TOKEN_PTN.match(token):
        alphabet[i] = UNK_TOKEN

decoder = build_ctcdecoder(
    labels=alphabet,
    kenlm_model_path=lm_arpa_path,
)

decoder._alphabet._labels = alphabet
processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

processor_with_lm.save_pretrained("xxxxxx")

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
sanchit-gandhicommented, Oct 10, 2022

Hey @voidful, sorry about the delayed reply. I’ve taken a deeper look into your issue - it looks as though there is a mis-match between the tokeniser and LM’s vocabularies (12305 tokens to be exact): https://colab.research.google.com/drive/1v1qd4CUdSXKmrSYIMqMzMk_KCUMfMWu9?usp=sharing

For LM boosted beam-search decoding for CTC, we need the vocabulary of the LM to match that of the tokeniser one-to-one. You can ensure this by training your LM using the same method that you use to train the Wav2Vec2 tokeniser. You then shouldn’t have to override the method decoder._alphabel.labels: the vocabularies should already match (barring the special tokens).

See this example for creating a tokeniser: https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/get_ctc_tokenizer.py

And this example for creating a corresponding LM: https://github.com/sanchit-gandhi/seq2seq-speech/blob/main/get_ctc_ngram.py

This blog also explains succinctly how one can train and instantiate an LM: https://huggingface.co/blog/wav2vec2-with-ngram

1reaction
LysandreJikcommented, Jul 21, 2022
Read more comments on GitHub >

github_iconTop Results From Across the Web

Crash on google colab #20526 - huggingface/transformers
pyctcdecode ==0.4.0 ... from transformers import Wav2Vec2ProcessorWithLM ... I really dont know is that a conflict by cpu and gpu???
Read more >
Pipelines - Hugging Face
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex...
Read more >
notebook4317d7a022 - Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources.
Read more >
pyctcdecode — A new beam search decoder for CTC speech ...
pyctcdecode is a library providing fast and feature-rich beam search decoding for speech recognition with Connectionist Temporal ...
Read more >
Pyctcdecode & Speech2text decoding - YouTube
Read everything about the Speech Challenge: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found