question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird transcriptions from a fine-tuned model

See original GitHub issue

Hi,

I fine-tuned a pre-trained French ASR model “stt_fr_conformer_ctc_large” with my own data. And I replaced the tokenizer with my own by model.change_vocabulary(new_tokenizer_dir=my_own_tokenizer, new_tokenizer_type="bpe").

The training was smooth and I got 17% valid WER which is rather good for our dataset, but when I did transcription=asr_model.transcribe(["example.wav"]) for inference, I got weird transcriptions as below : image It seems that something changed the alphabet, and here is another example: image I looked back into the training log and I found the same thing : image

How can I get good transcriptions please ? And is my 17% WER correct ? Thanks in advance.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
titu1994commented, Jan 10, 2022

That would be incorrect. Your tokenizer needs to be updated first - it is what’s used to tokenize your dataset text into tokens. If you put it after dataset preparation then your tokenizer is incorrect and unused.

0reactions
BenoitWangcommented, Jan 10, 2022

Oh sorry for the wrong message. Yes as you said I had first updated the tokenizer and then the datasets, and it worked. Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wav2Vec2 WER remains 1.00 and return blank transcriptions
Hello everyone. I faced a strange problem and not sure about how i can resolve the problem myself. I was fine tuning wav2vec2...
Read more >
How to Fine-tune a GPT-3 Model - Step by Step - YouTube
In this video, we're going to go over how to fine-tune a GPT-3 model. We'll start by creating a prompt, then we'll generate...
Read more >
Creating fine-tuned GPT-3 models via the OpenAI ... - YouTube
Join the Bugout Slack dev community to connect with fellow data scientists, ML practitioners, and engineers: ...
Read more >
Hallucination of speech recognition errors with sequence to ...
We use this recognizer to obtain 1-best transcriptions for the 1.8 million odd utterances in the Fisher corpus at a roughly 30% word...
Read more >
Fine-tuned universe - Wikipedia
The characterization of the universe as finely tuned suggests that the occurrence of life in the universe is very sensitive to the values...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found