Weird transcriptions from a fine-tuned model
See original GitHub issueHi,
I fine-tuned a pre-trained French ASR model “stt_fr_conformer_ctc_large” with my own data. And I replaced the tokenizer with my own by model.change_vocabulary(new_tokenizer_dir=my_own_tokenizer, new_tokenizer_type="bpe")
.
The training was smooth and I got 17% valid WER which is rather good for our dataset, but when I did transcription=asr_model.transcribe(["example.wav"])
for inference, I got weird transcriptions as below :
It seems that something changed the alphabet, and here is another example:
I looked back into the training log and I found the same thing :
How can I get good transcriptions please ? And is my 17% WER correct ? Thanks in advance.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8
Top Results From Across the Web
Wav2Vec2 WER remains 1.00 and return blank transcriptions
Hello everyone. I faced a strange problem and not sure about how i can resolve the problem myself. I was fine tuning wav2vec2...
Read more >How to Fine-tune a GPT-3 Model - Step by Step - YouTube
In this video, we're going to go over how to fine-tune a GPT-3 model. We'll start by creating a prompt, then we'll generate...
Read more >Creating fine-tuned GPT-3 models via the OpenAI ... - YouTube
Join the Bugout Slack dev community to connect with fellow data scientists, ML practitioners, and engineers: ...
Read more >Hallucination of speech recognition errors with sequence to ...
We use this recognizer to obtain 1-best transcriptions for the 1.8 million odd utterances in the Fisher corpus at a roughly 30% word...
Read more >Fine-tuned universe - Wikipedia
The characterization of the universe as finely tuned suggests that the occurrence of life in the universe is very sensitive to the values...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
That would be incorrect. Your tokenizer needs to be updated first - it is what’s used to tokenize your dataset text into tokens. If you put it after dataset preparation then your tokenizer is incorrect and unused.
Oh sorry for the wrong message. Yes as you said I had first updated the tokenizer and then the datasets, and it worked. Thanks