Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wav2vec2 : Speech to text conversion fails when using large file

See original GitHub issue

I’m trying to get the translation of an English Speech ( WAV file : 1.2 GB); But, Seeing error due to vector size. Same code works when the file size is low.

Environment info

transformers version: 4.17.0
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.12
PyTorch version (GPU?): 1.11.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Models: - Wav2Vec2: @patrickvonplaten, @anton-l

Information

Model I am using (Bert, XLNet …): Wav2Vec2

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Download a english speech from youtube as mp3
Convert the mp3 to wav (almost 1.2 GB file size)
Use wav2vec2 to provide speech to text translation
But, The model fails due to file length or the Frame rate. Is there any limitation on these vectors?
Further details of the .wav file

Metadata of the .wav file for which the model works :

Framerate :  8000
Channel info :  1
Bytes/sample :  2
Maximum amplitude :  32767
Length of audio :  51028

Metadata of the .wav file for which the model doesn’t work :

Framerate :  44100
Channel info :  2
Bytes/sample :  2
Maximum amplitude :  32768
Length of audio :  5901073

Stack Trace :

Traceback (most recent call last):
  File "main.py", line 59, in <module>
    logits = model(input_values)["logits"]
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1751, in forward
    outputs = self.wav2vec2(
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1347, in forward
    extract_features = self.feature_extractor(input_values)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 515, in forward
    hidden_states = conv_layer(hidden_states)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 415, in forward
    hidden_states = self.conv(hidden_states)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 302, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 94417166]

Expected behavior

To print the text that is converted from the audio

Issue Analytics

State:
Created a year ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

anton-lcommented, Apr 4, 2022

Hello @iamshreeram!

The error actually refers to the shape of the input: it should be (batch_size, 1, sequence_length), meaning that the inputs have to be single-channel arrays (mono audio) while your second file has 2 channels (stereo audio).

But anyways, the model itself won’t be able to run on the whole 1.2GB file since it’s too long for a single batch (you’ll see a memory error). For such long files we have an ASR pipeline with chunked inference, which you can learn about in this tutorial: https://huggingface.co/blog/asr-chunking The pipeline will handle stereo-to-mono conversion too, so you’ll just have to specify an input filename.

Let me know if it works for you 😃

0reactions

github-actions[bot]commented, May 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

Wav2Vec2 - Hugging Face

This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Tips: Wav2Vec2 is a speech model that accepts a float...

Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub

The underlying task is to build a model for Automatic Speech Recognition i.e. given some speech, the model should be able to transcribe...

Speech to Text with Wav2Vec 2.0 - KDnuggets

Let's see how we can convert the audio file into text using Hugging Face ... Wav2Vec2 is a speech model that accepts a...

Automatic Speech Recognition Using Wav2Vec2

Step 3: Creating app.py File · 1. Importing necessary libraries · 2. Loading the pre-trained model and the corresponding tokenizer · 3. Creating...

How to get reports from audio files using speech recognition ...

Pre-processing stage (extension handling and resampling); Speech to Text conversion; Text analysis and report generation.