question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Transcription being concatenated oddly

See original GitHub issue

I am trying to use the ctc decoding feature with kenlm on the wav2vec2 huggingface’s logits

vocab = ['l', 'z', 'u', 'k', 'f', 'r', 'g', 'i', 'v', 's', 'o', 'b', 'w', 'e', 'd', 'n', 'y', 'c', 'q', 'p', 'h', 't', 'a', 'x', ' ', 'j', 'm', '⁇', '', '⁇', '⁇']
alphabet = Alphabet.build_alphabet(vocab, ctc_token_idx=-3)
# Language Model
lm=LanguageModel(kenlm_model,alpha =0.169,
  beta = 0.055)
# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet,lm)

which returns the following output with beam size 64:

yeah jon okay i m calling from the clinic the family doctor clinessegryand this number six four five five one three o five

while when I was previously decoding with https://github.com/ynop/py-ctc-decode with the same lm and parameters getting:

yeah on okay i am calling from the clinic the family dot clinic try and this number six four five five one three o five

I don’t understand why the words are being concatenated together. Do you have any thoughts?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
gkucskocommented, Jun 18, 2021

see PR #4 for some additional warnings around unigrams as well as improved partial scoring without a trie that should help with word concatenation

0reactions
gkucskocommented, Jun 18, 2021

sounds great will have a look. thanks for the feedback and let me know if other issues come up!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stages of transcription: initiation, elongation & termination ...
Transcription overview​​ Transcription is the first step of gene expression. During this process, the DNA sequence of a gene is copied into RNA....
Read more >
Geneious Prime User Manual
To concatenate by name, sequences to be concatenated must have exactly the same name, including any spaces or punctuation. Note that names are...
Read more >
A concatenated tree of all single-copy gene sequences ...
A concatenated tree of all single-copy gene sequences (maximum likelihood, GTRGAMMA) in five species: P. tetraurelia (green; T), P. biaurelia (blue; B), ...
Read more >
How to delete every other row or every Nth row in Excel
Depending on whether you want to delete even or odd rows, filter out ones or zeros. To have it done, select any cell...
Read more >
Design Rationale · HypertextLiteral.jl - JuliaHub
With exception of boolean attributes (which must be removed to be false), templates are ... Instead, we treat a Vector as a sequence...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found