Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ASR Pipeline is not super user-friendly

See original GitHub issue

Feature request

Firstly, thank you to @Narsil for developing a the speech recognition pipeline - it’s incredibly helpful for running the full speech-to-text mapping in one call, pre and post-processing included.

There are a couple of things that currently mean the pipeline is not super compatible with 🤗 Datasets. I’ll motivate them below with an example.

Motivation

Let’s take the example of evaluating a (dummy) Wav2Vec2 checkpoint on the (dummy) LibriSpeech ASR dataset:

from transformers import pipeline
from datasets import load_dataset

pipe = pipeline("automatic-speech-recognition", model="hf-internal-testing/tiny-random-wav2vec2")
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:10]")

Printing the first audio sample of the dataset:

print(dataset[0]["audio"])

Print Output:

{'path': '/home/sanchit_huggingface_co/.cache/huggingface/datasets/downloads/extracted/0393f71a8093c6541f95c89f60982213cf086569876e1195926741f097ad47fc/dev_clean/1272/128104/1272-128104-0000.flac', 
'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
       0.0010376 ], dtype=float32), 
'sampling_rate': 16000}

So the audio’s are in the format: {"path": str, "array": np.array, "sampling_rate": int}. The np audio array values are stored under the key “array”. This format is ubiquitous across audio datasets in 🤗 Datasets: all audio datasets take this format.

However, pipeline expects the audio samples in the format {"sampling_rate": int, "raw": np.array}: https://github.com/huggingface/transformers/blob/0ee71188ff184ee5f8b70081665858301fe4afb1/src/transformers/pipelines/automatic_speech_recognition.py#L209-L211

This means we have to do some hacking around to get the audio samples into the right format for pipeline:

def predict(batch):
    audios = batch["audio"]
    # hacky renaming
    audios = [{"raw": sample["array"], "sampling_rate": sample["sampling_rate"]} for sample in audios]

    predictions = pipe(audios)

    # unpack and index predictions (List[Dict])
    batch["predictions"] = [pred["text"] for pred in predictions]
    return batch

And then apply the function to our dataset using the map method:

batch_size = 4

result_set = dataset.map(
    predict,
    batched=True,
    batch_size=batch_size,
    remove_columns=dataset.features.keys(),
)

If pipeline’s __call__ method was matched to Datasets’ audio features, we’d be able to use any audio dataset directly with pipeline (no hacky feature renaming):

def hypothetical_predict(batch):
    predictions = pipe(audios)
    batch["predictions"] = [pred["text"] for pred in predictions]
    return batch

This would be very nice for the user!

Furthermore, the outputs returned by pipeline are a list of dicts (List[Dict]): https://github.com/huggingface/transformers/blob/0ee71188ff184ee5f8b70081665858301fe4afb1/src/transformers/pipelines/automatic_speech_recognition.py#L477

This means we have to unpack and index them before we can use them for any downstream use (such as WER calculations).

It would be nice if pipeline returned a ModelOutput class. That way, we could index the text column directly from the returned object:

def hypothetical_predict(batch):
    batch["predictions"] = pipe(batch["audio"]).text
    return batch

IMO this is more intuitive to the user than renaming their audio column and then iterating over the returned Dict object to get the predicted text.

Your contribution

WDYT @Narsil @patrickvonplaten? Happy to add these changes to smooth out the user experience!

Issue Analytics

State:
Created 10 months ago
Comments:8 (8 by maintainers)

Top GitHub Comments

2reactions

Narsilcommented, Nov 30, 2022

What is the big drawback of this?

This is already done, it’s a doc issue. And specifically for sanchit, datasets are using {"audio" : {"sampling_rate": .., "audio": ..}} instead of the inner dict.

The current generate method is a) too bloated and b) just not adapted for speech recognition.

True, I have potential suggestions for it, which mainly are going full on Processor/StoppingCriteria route. This is what was necessary to enable complex batching within bloom inference. Splitting specifically for audio might be necessary but I am under the impression it’s only a matter of defaults for those objects.

1reaction

sanchit-gandhicommented, Nov 25, 2022

Thank you again for the super comprehensive reply, really appreciate the time given to answering this thread!

make ML accessible for users who have no clue what is a model or tensors

Awesome! Think it’s fantastic in this regard. Having some easy examples that show you how to run pipeline in different scenarios / tasks like a little ‘recipe’ book would be great to further this.

otherwise it would not be written in python, and it would not be that convenient 😃

Did someone say Rust 👀

Thanks for linking the tutorials - I learnt quite a lot from this thread + docs after knowing where to look. I guess you have two camps of people that will be using pipeline:

Those migrating from the transformers approach (feature extractor + model + processor)
Those who don’t use transformers

For me, it was making the link between my transformers approach and pipeline that made the penny drop. There’s a bit of a different mindset which you have to adopt vs the usual datasets .map method. I think some more examples showing how to make actual transformers tasks work in pipeline would go a long way! In this regard, your updated tutorial looks amazing (doing exactly this)! Happy to do a pass of the PR when it’s in a review ready state!

Would that work ? (I haven’t tested this)

It did indeed work, thanks 🙌

Top Results From Across the Web

Making automatic speech recognition work on large files with ...

Tl;dr: This post explains how to use the specificities of the ... to achieve very good quality automatic speech recognition (ASR) even on ......

What is Automatic Speech Recognition? - NVIDIA Developer

This post discusses ASR, how it works, use cases, advancements, and more. What is automatic speech recognition? Speech recognition technology is ...

Introduction to Automatic Speech Recognition (ASR) -

The classical pipeline in an ASR-powered application involves the Speech-to-text, Natural Language Processing and Text-to-speech. image. ASR is not easy ...

One Voice Detector to Rule Them All - The Gradient

Call-center automation (e.g. as a first stage of ASR pipeline); ... Although there are many public VADs out there, not all of them...

Realtime ASR - Malaya-Speech's documentation!

This module is not language independent, so it not save to use on different languages. ... This is an application of malaya-speech Pipeline,...