ASR Pipeline is not super user-friendly
See original GitHub issueFeature request
Firstly, thank you to @Narsil for developing a the speech recognition pipeline - it’s incredibly helpful for running the full speech-to-text mapping in one call, pre and post-processing included.
There are a couple of things that currently mean the pipeline is not super compatible with 🤗 Datasets. I’ll motivate them below with an example.
Motivation
Let’s take the example of evaluating a (dummy) Wav2Vec2 checkpoint on the (dummy) LibriSpeech ASR dataset:
from transformers import pipeline
from datasets import load_dataset
pipe = pipeline("automatic-speech-recognition", model="hf-internal-testing/tiny-random-wav2vec2")
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]")
Printing the first audio sample of the dataset:
print(dataset[0]["audio"])
Print Output:
{'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/0393f71a8093c6541f95c89f60982213cf086569876e1195926741f097ad47fc/dev_clean/1272/128104/1272-128104-0000.flac',
'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
0.0010376 ], dtype=float32),
'sampling_rate': 16000}
So the audio’s are in the format: {"path": str, "array": np.array, "sampling_rate": int}
. The np audio array values are stored under the key “array”. This format is ubiquitous across audio datasets in 🤗 Datasets: all audio datasets take this format.
However, pipeline expects the audio samples in the format {"sampling_rate": int, "raw": np.array}
:
https://github.com/huggingface/transformers/blob/0ee71188ff184ee5f8b70081665858301fe4afb1/src/transformers/pipelines/automatic_speech_recognition.py#L209-L211
This means we have to do some hacking around to get the audio samples into the right format for pipeline:
def predict(batch):
audios = batch["audio"]
# hacky renaming
audios = [{"raw": sample["array"], "sampling_rate": sample["sampling_rate"]} for sample in audios]
predictions = pipe(audios)
# unpack and index predictions (List[Dict])
batch["predictions"] = [pred["text"] for pred in predictions]
return batch
And then apply the function to our dataset using the map
method:
batch_size = 4
result_set = dataset.map(
predict,
batched=True,
batch_size=batch_size,
remove_columns=dataset.features.keys(),
)
If pipeline’s __call__
method was matched to Datasets’ audio features, we’d be able to use any audio dataset directly with pipeline (no hacky feature renaming):
def hypothetical_predict(batch):
predictions = pipe(audios)
batch["predictions"] = [pred["text"] for pred in predictions]
return batch
This would be very nice for the user!
Furthermore, the outputs returned by pipeline are a list of dicts (List[Dict]
):
https://github.com/huggingface/transformers/blob/0ee71188ff184ee5f8b70081665858301fe4afb1/src/transformers/pipelines/automatic_speech_recognition.py#L477
This means we have to unpack and index them before we can use them for any downstream use (such as WER calculations).
It would be nice if pipeline returned a ModelOutput
class. That way, we could index the text column directly from the returned object:
def hypothetical_predict(batch):
batch["predictions"] = pipe(batch["audio"]).text
return batch
IMO this is more intuitive to the user than renaming their audio column and then iterating over the returned Dict object to get the predicted text.
Your contribution
WDYT @Narsil @patrickvonplaten? Happy to add these changes to smooth out the user experience!
Issue Analytics
- State:
- Created 10 months ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
This is already done, it’s a doc issue. And specifically for sanchit, datasets are using
{"audio" : {"sampling_rate": .., "audio": ..}}
instead of the inner dict.True, I have potential suggestions for it, which mainly are going full on Processor/StoppingCriteria route. This is what was necessary to enable complex batching within bloom inference. Splitting specifically for audio might be necessary but I am under the impression it’s only a matter of defaults for those objects.
Thank you again for the super comprehensive reply, really appreciate the time given to answering this thread!
Awesome! Think it’s fantastic in this regard. Having some easy examples that show you how to run pipeline in different scenarios / tasks like a little ‘recipe’ book would be great to further this.
Did someone say Rust 👀
Thanks for linking the tutorials - I learnt quite a lot from this thread + docs after knowing where to look. I guess you have two camps of people that will be using pipeline:
For me, it was making the link between my transformers approach and pipeline that made the penny drop. There’s a bit of a different mindset which you have to adopt vs the usual datasets
.map
method. I think some more examples showing how to make actual transformers tasks work in pipeline would go a long way! In this regard, your updated tutorial looks amazing (doing exactly this)! Happy to do a pass of the PR when it’s in a review ready state!It did indeed work, thanks 🙌