wav2vec2 : Speech to text conversion fails when using large file
See original GitHub issueI’m trying to get the translation of an English Speech ( WAV file : 1.2 GB); But, Seeing error due to vector size. Same code works when the file size is low.
Environment info
transformers
version: 4.17.0- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.12
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help
Models: - Wav2Vec2: @patrickvonplaten, @anton-l
Information
Model I am using (Bert, XLNet …): Wav2Vec2
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Download a english speech from youtube as mp3
- Convert the mp3 to wav (almost 1.2 GB file size)
- Use wav2vec2 to provide speech to text translation
- But, The model fails due to file length or the Frame rate. Is there any limitation on these vectors?
- Further details of the
.wav
file
Metadata of the .wav
file for which the model works :
Framerate : 8000
Channel info : 1
Bytes/sample : 2
Maximum amplitude : 32767
Length of audio : 51028
Metadata of the .wav
file for which the model doesn’t work :
Framerate : 44100
Channel info : 2
Bytes/sample : 2
Maximum amplitude : 32768
Length of audio : 5901073
Stack Trace :
Traceback (most recent call last):
File "main.py", line 59, in <module>
logits = model(input_values)["logits"]
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1751, in forward
outputs = self.wav2vec2(
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1347, in forward
extract_features = self.feature_extractor(input_values)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 515, in forward
hidden_states = conv_layer(hidden_states)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 415, in forward
hidden_states = self.conv(hidden_states)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 302, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/Applications/anaconda3/envs/speechml/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 94417166]
Expected behavior
To print the text that is converted from the audio
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Wav2Vec2 - Hugging Face
This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Tips: Wav2Vec2 is a speech model that accepts a float...
Read more >Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
The underlying task is to build a model for Automatic Speech Recognition i.e. given some speech, the model should be able to transcribe...
Read more >Speech to Text with Wav2Vec 2.0 - KDnuggets
Let's see how we can convert the audio file into text using Hugging Face ... Wav2Vec2 is a speech model that accepts a...
Read more >Automatic Speech Recognition Using Wav2Vec2
Step 3: Creating app.py File · 1. Importing necessary libraries · 2. Loading the pre-trained model and the corresponding tokenizer · 3. Creating...
Read more >How to get reports from audio files using speech recognition ...
Pre-processing stage (extension handling and resampling); Speech to Text conversion; Text analysis and report generation.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hello @iamshreeram!
The error actually refers to the shape of the input: it should be
(batch_size, 1, sequence_length)
, meaning that the inputs have to be single-channel arrays (mono audio) while your second file has 2 channels (stereo audio).But anyways, the model itself won’t be able to run on the whole 1.2GB file since it’s too long for a single batch (you’ll see a memory error). For such long files we have an ASR pipeline with chunked inference, which you can learn about in this tutorial: https://huggingface.co/blog/asr-chunking The pipeline will handle stereo-to-mono conversion too, so you’ll just have to specify an input filename.
Let me know if it works for you 😃
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.