question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

wav2vec2 : Speech to text conversion fails when using large file

See original GitHub issue

I’m trying to get the translation of an English Speech ( WAV file : 1.2 GB); But, Seeing error due to vector size. Same code works when the file size is low.

Environment info

  • transformers version: 4.17.0
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.8.12
  • PyTorch version (GPU?): 1.11.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

Models: - Wav2Vec2: @patrickvonplaten, @anton-l

Information

Model I am using (Bert, XLNet …): Wav2Vec2

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Download a english speech from youtube as mp3
  2. Convert the mp3 to wav (almost 1.2 GB file size)
  3. Use wav2vec2 to provide speech to text translation
  4. But, The model fails due to file length or the Frame rate. Is there any limitation on these vectors?
  5. Further details of the .wav file

Metadata of the .wav file for which the model works :

Framerate :  8000
Channel info :  1
Bytes/sample :  2
Maximum amplitude :  32767
Length of audio :  51028

Metadata of the .wav file for which the model doesn’t work :

Framerate :  44100
Channel info :  2
Bytes/sample :  2
Maximum amplitude :  32768
Length of audio :  5901073

Stack Trace :

Traceback (most recent call last):
  File "main.py", line 59, in <module>
    logits = model(input_values)["logits"]
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1751, in forward
    outputs = self.wav2vec2(
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1347, in forward
    extract_features = self.feature_extractor(input_values)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 515, in forward
    hidden_states = conv_layer(hidden_states)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 415, in forward
    hidden_states = self.conv(hidden_states)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 302, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 94417166]

Expected behavior

To print the text that is converted from the audio

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
anton-lcommented, Apr 4, 2022

Hello @iamshreeram!

The error actually refers to the shape of the input: it should be (batch_size, 1, sequence_length), meaning that the inputs have to be single-channel arrays (mono audio) while your second file has 2 channels (stereo audio).

But anyways, the model itself won’t be able to run on the whole 1.2GB file since it’s too long for a single batch (you’ll see a memory error). For such long files we have an ASR pipeline with chunked inference, which you can learn about in this tutorial: https://huggingface.co/blog/asr-chunking The pipeline will handle stereo-to-mono conversion too, so you’ll just have to specify an input filename.

Let me know if it works for you 😃

0reactions
github-actions[bot]commented, May 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wav2Vec2 - Hugging Face
This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Tips: Wav2Vec2 is a speech model that accepts a float...
Read more >
Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
The underlying task is to build a model for Automatic Speech Recognition i.e. given some speech, the model should be able to transcribe...
Read more >
Speech to Text with Wav2Vec 2.0 - KDnuggets
Let's see how we can convert the audio file into text using Hugging Face ... Wav2Vec2 is a speech model that accepts a...
Read more >
Automatic Speech Recognition Using Wav2Vec2
Step 3: Creating app.py File · 1. Importing necessary libraries · 2. Loading the pre-trained model and the corresponding tokenizer · 3. Creating...
Read more >
How to get reports from audio files using speech recognition ...
Pre-processing stage (extension handling and resampling); Speech to Text conversion; Text analysis and report generation.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found