question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation of wav2vec2 model all labeled string return "<unk>" value

See original GitHub issue

System Info

  • transformers version: 4.22.0.dev0

  • Platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.10

  • Python version: 3.8.8

  • Huggingface_hub version: 0.8.1

  • PyTorch version (GPU?): 1.12.1+cu116 (True)

  • Tensorflow version (GPU?): not installed (NA)

  • Flax version (CPU?/GPU?/TPU?): not installed (NA)

  • Jax version: not installed

  • JaxLib version: not installed

  • Using GPU in script?: Yes

  • Using distributed or parallel set-up in script?: Both have same issue

  • $ pip freeze |grep datasets datasets==2.4.0

Who can help?

@patrickvonplaten @anton-l @sanchit-gandhi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the issue:

  1. Download the issue_report folder to your local
  2. open a command prompt and cd to the issue_report
  3. run eval cmd: python ctc_finetune.py --eval
  4. The loss is 1.0086 as the value of label_str always β€œunk”, which printed at the line # 566 of ctc_finetune.py
  5. To re-generate the datase cache files, please run : python customise_dataset.py

Here is the log printed at end of evaluation as following, please see the full_log.log for more details : ***** Running Evaluation ***** Num examples = 91 Batch size = 4 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23/23 [00:03<00:00, 5.54it/s] pred_str[0]: THERE WERE BARRELS OF WINE IN THE SHU CELLOR label_str[0]: <unk><unk><unk><unk><unk> <unk><unk><unk><unk> <unk><unk><unk><unk><unk><unk><unk> <unk><unk> <unk><unk><unk><unk> <unk><unk> <unk><unk><unk> <unk><unk><unk><unk> <unk><unk><unk><unk><unk><unk> 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23/23 [00:03<00:00, 6.12it/s] ***** eval metrics ***** eval_loss = 4704.6416 eval_runtime = 0:00:06.64 eval_samples = 91 eval_samples_per_second = 13.697 eval_steps_per_second = 3.462 eval_wer = 1.0086

Expected behavior

As I use original pre-trained model : facebook/wav2vec2-large-robust-ft-libri-960h for evaluation, the only changes is my customized dataset. I could not figure out where is wrong with my own modified scripts which had just minor change from the official example scripts. So I am not sure my encountered issue whether it’s my scripts issue or the finetune libs issue. Thanks in advance for helping me on this matter.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
lgq-liaocommented, Sep 27, 2022

Ok I think that’s the issue. Your vocabulary likely only contains upper case letters. The tokenizer doesn’t recognise lower case letters so it uses <unk> instead.

Try converting your transcription column to upper case and see if that fixes it.

Yeah, that is the root cause. After I changed it to upper case, the issue go aways: Thank you so much for the troubleshoot.

***** Running Evaluation ***** Num examples = 91 Batch size = 4 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23/23 [00:03<00:00, 6.42it/s]pred_str[0]: THERE WERE BARRELS OF WINE IN THE SHU CELLOR label_str[0]: THERE WERE BARRELS OF WINE IN THE HUGE CELLAR 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 23/23 [00:03<00:00, 6.68it/s] ***** eval metrics ***** eval_loss = 118.7373 eval_runtime = 0:00:05.39 eval_samples = 91 eval_samples_per_second = 16.856 eval_steps_per_second = 4.26 eval_wer = 0.1228

0reactions
OllieBroadhurstcommented, Sep 26, 2022

Ok I think that’s the issue. Your vocabulary likely only contains upper case letters. The tokenizer doesn’t recognise lower case letters so it uses <unk> instead.

Try converting your transcription column to upper case and see if that fixes it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wav2Vec2 - Hugging Face
This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Tips: Wav2Vec2 is a speech model that accepts a float...
Read more >
Improving Mispronunciation Detection with Wav2vec2-based ...
In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained selfΒ ...
Read more >
Wave2Vec2.0 fine-tuning english - Kaggle
We write a mapping function that concatenates all transcriptions into one long transcription and then transforms the string into a set of chars....
Read more >
torchaudio.pipelines - wav2vec 2.0 / HuBERT - PyTorch
(default: '<unk>' ). Returns. For models fine-tuned on ASR, returns the tuple of strings representing the output class labels. Return type. Tuple[str].
Read more >
Domain Adaptation with N-gram Language Models for ... - kth .diva
The 4-gram models are evaluated by the error rate (ERR) of recognizing in-domain words in a ... However, a few Swedish Wav2vec2 models...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found