Transcription being concatenated oddly
See original GitHub issueI am trying to use the ctc decoding feature with kenlm on the wav2vec2 huggingface’s logits
vocab = ['l', 'z', 'u', 'k', 'f', 'r', 'g', 'i', 'v', 's', 'o', 'b', 'w', 'e', 'd', 'n', 'y', 'c', 'q', 'p', 'h', 't', 'a', 'x', ' ', 'j', 'm', '⁇', '', '⁇', '⁇']
alphabet = Alphabet.build_alphabet(vocab, ctc_token_idx=-3)
# Language Model
lm=LanguageModel(kenlm_model,alpha =0.169,
beta = 0.055)
# build the decoder and decode the logits
decoder = BeamSearchDecoderCTC(alphabet,lm)
which returns the following output with beam size 64:
yeah jon okay i m calling from the clinic the family doctor clinessegryand this number six four five five one three o five
while when I was previously decoding with https://github.com/ynop/py-ctc-decode with the same lm and parameters getting:
yeah on okay i am calling from the clinic the family dot clinic try and this number six four five five one three o five
I don’t understand why the words are being concatenated together. Do you have any thoughts?
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (7 by maintainers)
Top Results From Across the Web
Stages of transcription: initiation, elongation & termination ...
Transcription overview Transcription is the first step of gene expression. During this process, the DNA sequence of a gene is copied into RNA....
Read more >Geneious Prime User Manual
To concatenate by name, sequences to be concatenated must have exactly the same name, including any spaces or punctuation. Note that names are...
Read more >A concatenated tree of all single-copy gene sequences ...
A concatenated tree of all single-copy gene sequences (maximum likelihood, GTRGAMMA) in five species: P. tetraurelia (green; T), P. biaurelia (blue; B), ...
Read more >How to delete every other row or every Nth row in Excel
Depending on whether you want to delete even or odd rows, filter out ones or zeros. To have it done, select any cell...
Read more >Design Rationale · HypertextLiteral.jl - JuliaHub
With exception of boolean attributes (which must be removed to be false), templates are ... Instead, we treat a Vector as a sequence...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
see PR #4 for some additional warnings around unigrams as well as improved partial scoring without a trie that should help with word concatenation
sounds great will have a look. thanks for the feedback and let me know if other issues come up!