Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Beam search decoding and language model integration for Wav2Vec2ForCTC models

See original GitHub issue

AFAIK, Wav2Vec2ForCTCTokenizer.decode method only provides greedy decoding. Is there a Beamsearch implementation for CTC available yet?
Also, as it is a common norm in ASR modelling, language models are also generally added on top of the acoustic model. It would also be nice to have a possibility of appending a pretrained Language model which gets taken into consideration at the beamsearch decoding time. Not sure if there’s an out-of-box solution implemented for that yet?

I’m also aware of efforts to integrate a language model in #10794 and have had a look at the notebook here. Although it is a nice, simple way to integrate an LM, it is suboptimal when considering CTC semantics. A more appropriate approach would be the one described in this paper and explained in this distilpub blog. Would be great to have these features added (if they are already not there and I somehow missed them).

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:13 (5 by maintainers)

Top GitHub Comments

3reactions

patrickvonplatencommented, May 3, 2021

I think we can try to add a dependency to wav2letter: https://github.com/flashlight/wav2letter and add LM decoding as explained here on fairseq: https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md#evaluating-a-ctc-model . It would be awesome if we manage to create a nice run_wav2vec2_eval_with_lm.py script that people can use out of the box with every wav2vec2 model. We can also make a nice blog post out of this and publish it on our blog 😃

3reactions

deepang17commented, Apr 28, 2021

Hello @patrickvonplaten and @tanujjain,

I have already worked with prefix beam search decoding with language models for wav2vec2 and would like to implement it for huggingface, if you guys are okay with it.