Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Weird transcriptions from a fine-tuned model

See original GitHub issue

Hi,

I fine-tuned a pre-trained French ASR model “stt_fr_conformer_ctc_large” with my own data. And I replaced the tokenizer with my own by model.change_vocabulary(new_tokenizer_dir=my_own_tokenizer, new_tokenizer_type="bpe").

The training was smooth and I got 17% valid WER which is rather good for our dataset, but when I did transcription=asr_model.transcribe(["example.wav"]) for inference, I got weird transcriptions as below : It seems that something changed the alphabet, and here is another example: I looked back into the training log and I found the same thing :

How can I get good transcriptions please ? And is my 17% WER correct ? Thanks in advance.

Issue Analytics

State:
Created 2 years ago
Comments:8

Top GitHub Comments

1reaction

titu1994commented, Jan 10, 2022

That would be incorrect. Your tokenizer needs to be updated first - it is what’s used to tokenize your dataset text into tokens. If you put it after dataset preparation then your tokenizer is incorrect and unused.

0reactions

BenoitWangcommented, Jan 10, 2022

Oh sorry for the wrong message. Yes as you said I had first updated the tokenizer and then the datasets, and it worked. Thanks