ASR pipeline does not work with openai/whisper on current master
See original GitHub issueSystem Info
transformers @ git+https://github.com/huggingface/transformers.git@b651efe59ea506d38173e3a60a4228e7e74719f9 python 3.6 Standard AWS Ubuntu Deep Learning AMI (Ubuntu 18.04) Version 30.0
Who can help?
@Narsil @anton-l @sanchit-gandhi @patrickvonplaten
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
To reproduce run the following code, from asr pipeline example and whisper:
from datasets import load_dataset
from transformers import pipeline
pipe = pipeline(model="openai/whisper-large")
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
output = pipe(ds[0]['file'], chunk_length_s=30, stride_length_s=(4, 2))
yields:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-efceed64cd5c> in <module>
----> 1 output = pipe(ds[0]['file'], chunk_length_s=30, stride_length_s=(4, 2))
~/venv38/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py in __call__(self, inputs, **kwargs)
181 `"".join(chunk["text"] for chunk in output["chunks"])`.
182 """
--> 183 return super().__call__(inputs, **kwargs)
184
185 def _sanitize_parameters(self, **kwargs):
~/venv38/lib/python3.8/site-packages/transformers/pipelines/base.py in __call__(self, inputs, num_workers, batch_size, *args, **kwargs)
1072 return self.iterate(inputs, preprocess_params, forward_params, postprocess_params)
1073 else:
-> 1074 return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
1075
1076 def run_multi(self, inputs, preprocess_params, forward_params, postprocess_params):
~/venv38/lib/python3.8/site-packages/transformers/pipelines/base.py in run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
1093 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
1094 all_outputs = []
-> 1095 for model_inputs in self.preprocess(inputs, **preprocess_params):
1096 model_outputs = self.forward(model_inputs, **forward_params)
1097 all_outputs.append(model_outputs)
~/venv38/lib/python3.8/site-packages/transformers/pipelines/automatic_speech_recognition.py in preprocess(self, inputs, chunk_length_s, stride_length_s)
260 # Currently chunking is not possible at this level for `seq2seq` so
261 # it's ok.
--> 262 align_to = self.model.config.inputs_to_logits_ratio
263 chunk_len = int(round(chunk_length_s * self.feature_extractor.sampling_rate / align_to) * align_to)
264 stride_left = int(round(stride_length_s[0] * self.feature_extractor.sampling_rate / align_to) * align_to)
~/venv38/lib/python3.8/site-packages/transformers/configuration_utils.py in __getattribute__(self, key)
252 if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
253 key = super().__getattribute__("attribute_map")[key]
--> 254 return super().__getattribute__(key)
255
256 def __init__(self, **kwargs):
AttributeError: 'WhisperConfig' object has no attribute 'inputs_to_logits_ratio'
Expected behavior
I would’ve expected to obtain the transcript in output
.
Issue Analytics
- State:
- Created a year ago
- Comments:14 (11 by maintainers)
Top Results From Across the Web
How Will OpenAI Whisper Impact Current Commercial ASR ...
In their paper on Whisper, OpenAI is questioning the validity of WER measurements in terms of the intelligibility of transcriptions, and that ...
Read more >openai/whisper-medium.en
The Whisper models are trained for speech recognition and translation tasks, capable of transcribing speech audio into the text in the language it...
Read more >How to transcribe YouTube videos with OpenAI Whisper
Learn how to download, transcode, and transcribe YouTube videos with the OpenAI Whisper Automatic Speech Recognition or ASR model in Python.
Read more >r/Python - OpenAI's Whisper: an open-sourced neural net ...
OpenAI's Whisper : an open-sourced neural net "that approaches human level robustness and accuracy on English speech recognition." Can be ...
Read more >abenchmark for multi-domain end-to-end speech recognition
Recent work has looked at training ASR systems on orthographic ... and unique domains of speech recognition, but do not require their ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Really sorry about my miss-communication. The chunking that will be supported is different from CTC. Let’s organize a call to speak in more details about that 😉 The goal would be to be able to specify a chunk length and stride length (if people want to customize it) but default Whisper has its own parameters. Let’s talk more about that when we call 🤗
@ArthurZucker are we sure
Whisper
can handle chunking ?from internal conversation.
Happy to jump into a design call to discuss whether we can do it or not.
Not being CTC means it’s harder to handle the boundaries. Boundaries at silence are sort of OK, but unfortunately can never really a complete solution (because you can never be sure you’re going to get a silence, and you MUST be able to handle chunking regardless). This might be deemed acceptable in whisper btw, but when we checked for regular models, the regular silence detection was not good enough to be ran automatically (meaning you have to tune settings always to get decent silence results with most silence detectors)