question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ASR pipeline does not work with openai/whisper on current master

See original GitHub issue

System Info

transformers @ git+https://github.com/huggingface/transformers.git@b651efe59ea506d38173e3a60a4228e7e74719f9 python 3.6 Standard AWS Ubuntu Deep Learning AMI (Ubuntu 18.04) Version 30.0

Who can help?

@Narsil @anton-l @sanchit-gandhi @patrickvonplaten

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

To reproduce run the following code, from asr pipeline example and whisper:

from datasets import load_dataset
from transformers import pipeline

pipe = pipeline(model="openai/whisper-large")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
output = pipe(ds[0]['file'], chunk_length_s=30, stride_length_s=(4, 2))

yields:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-efceed64cd5c> in <module>
----> 1 output = pipe(ds[0]['file'], chunk_length_s=30, stride_length_s=(4, 2))

~/venv38/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py in __call__(self, inputs, **kwargs)
    181                         `"".join(chunk["text"] for chunk in output["chunks"])`.
    182         """
--> 183         return super().__call__(inputs, **kwargs)
    184 
    185     def _sanitize_parameters(self, **kwargs):

~/venv38/lib/python3.8/site-packages/transformers/pipelines/base.py in __call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1072             return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
   1073         else:
-> 1074             return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
   1075 
   1076     def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):

~/venv38/lib/python3.8/site-packages/transformers/pipelines/base.py in run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1093     def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1094         all_outputs = []
-> 1095         for model_inputs in self.preprocess(inputs, **preprocess_params):
   1096             model_outputs = self.forward(model_inputs, **forward_params)
   1097             all_outputs.append(model_outputs)

~/venv38/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py in preprocess(self, inputs, chunk_length_s, stride_length_s)
    260             # Currently chunking is not possible at this level for `seq2seq` so
    261             # it's ok.
--> 262             align_to = self.model.config.inputs_to_logits_ratio
    263             chunk_len = int(round(chunk_length_s * self.feature_extractor.sampling_rate / align_to) * align_to)
    264             stride_left = int(round(stride_length_s[0] * self.feature_extractor.sampling_rate / align_to) * align_to)

~/venv38/lib/python3.8/site-packages/transformers/configuration_utils.py in __getattribute__(self, key)
    252         if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
    253             key = super().__getattribute__("attribute_map")[key]
--> 254         return super().__getattribute__(key)
    255 
    256     def __init__(self, **kwargs):

AttributeError: 'WhisperConfig' object has no attribute 'inputs_to_logits_ratio'

Expected behavior

I would’ve expected to obtain the transcript in output.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
ArthurZuckercommented, Oct 11, 2022

Really sorry about my miss-communication. The chunking that will be supported is different from CTC. Let’s organize a call to speak in more details about that 😉 The goal would be to be able to specify a chunk length and stride length (if people want to customize it) but default Whisper has its own parameters. Let’s talk more about that when we call 🤗

3reactions
Narsilcommented, Oct 11, 2022

@ArthurZucker are we sure Whisper can handle chunking ?

Whisper is not a CTC model meaning that chunking as shown in Nico’s blog does not work.

from internal conversation.

Happy to jump into a design call to discuss whether we can do it or not.

Not being CTC means it’s harder to handle the boundaries. Boundaries at silence are sort of OK, but unfortunately can never really a complete solution (because you can never be sure you’re going to get a silence, and you MUST be able to handle chunking regardless). This might be deemed acceptable in whisper btw, but when we checked for regular models, the regular silence detection was not good enough to be ran automatically (meaning you have to tune settings always to get decent silence results with most silence detectors)

Read more comments on GitHub >

github_iconTop Results From Across the Web

How Will OpenAI Whisper Impact Current Commercial ASR ...
In their paper on Whisper, OpenAI is questioning the validity of WER measurements in terms of the intelligibility of transcriptions, and that ...
Read more >
openai/whisper-medium.en
The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it...
Read more >
How to transcribe YouTube videos with OpenAI Whisper
Learn how to download, transcode, and transcribe YouTube videos with the OpenAI Whisper Automatic Speech Recognition or ASR model in Python.
Read more >
r/Python - OpenAI's Whisper: an open-sourced neural net ...
OpenAI's Whisper : an open-sourced neural net "that approaches human level robustness and accuracy on English speech recognition." Can be ...
Read more >
abenchmark for multi-domain end-to-end speech recognition
Recent work has looked at training ASR systems on orthographic ... and unique domains of speech recognition, but do not require their ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found