Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can't allocate memory error with wav2vec2

See original GitHub issue

I am trying out the wav2vec2 model for ASR from the huggingface library. Here, I am passing a 7 min(~15 MB file) long wav file having a conversation(english) to the wav2vec2 model. I am getting “can’t allocate memory” error. I found that the model uses all 64 GB of the available RAM. Can anyone help with this.

transformers version: 4.3.2
Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.8.3
PyTorch version (GPU?): 1.7.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: (NA)
Using distributed or parallel set-up in script?: (NA)

Code

import os
import librosa
import soundfile as sf
from pydub import AudioSegment

def convert_audio_segment(fp, upload_dir_path):
    """Convert audio file"""
    
    USER_UPLOAD_DIR = upload_dir_path
    formats_to_convert = ['.m4a']
    dirpath = os.path.abspath(USER_UPLOAD_DIR)

    if fp.endswith(tuple(formats_to_convert)):

        (path, file_extension) = os.path.splitext(fp)
        file_extension_final = file_extension.replace('.', '')
        file_handle = ''

        try:
            track = AudioSegment.from_file(fp,
                    file_extension_final)
            print("track", track)
            wav_path = fp.replace(file_extension_final, 'wav')            
            file_handle = track.export(wav_path, format='wav')
        except Exception:
            print("ERROR CONVERTING " + str(fp))
        return file_handle
    else:
        print("No file format conversion required " + str(fp))
        return fp

def load_wav2vec_100h_model():
    tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-100h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-100h")    
    return tokenizer, model

def correct_sentence(input_text):
    sentences = nltk.sent_tokenize(input_text)
    return (' '.join([s.replace(s[0],s[0].capitalize(),1) for s in sentences]))

def asr_transcript(tokenizer, model, input_file):
   
    speech, fs = sf.read(input_file)

    if len(speech.shape) > 1: 
        speech = speech[:,0] + speech[:,1]

    if fs != 16000:
        speech = librosa.resample(speech, fs, 16000)

    input_values = tokenizer(speech, return_tensors="pt").input_values
    logits = model(input_values).logits
    
    predicted_ids = torch.argmax(logits, dim=-1)
    
    transcription = tokenizer.decode(predicted_ids[0])

    return correct_sentence(transcription.lower())

if __name__ == "__main__":


    tokenizer_100h, model_100h = load_wav2vec_100h_model()
    wav_input = 'Recording_biweu.wav'
    fp = wav_input

    processed_file = convert_audio_segment(str(fp), str(data_dir))
    text = asr_transcript(tokenizer_100h,model_100h,processed_file)
    print(text)

I am adding more details about my wav file here

General
Complete name                            : Recording_biweu.wav
Format                                   : Wave
File size                                : 13.8 MiB
Duration                                 : 7 min 30 s
Overall bit rate mode                    : Constant
Overall bit rate                         : 256 kb/s
Track name                               : Recording_biweu
Recorded date                            : 2021
Writing application                      : Lavf57.83.100

Audio
Format                                   : PCM
Format settings                          : Little / Signed
Codec ID                                 : 1
Duration                                 : 7 min 30 s
Bit rate mode                            : Constant
Bit rate                                 : 256 kb/s
Channel(s)                               : 1 channel
Sampling rate                            : 16.0 kHz
Bit depth                                : 16 bits
Stream size                              : 13.8 MiB (100%)

Error

Some weights of the model checkpoint at facebook/wav2vec2-base-100h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.mask_time_emb_vector']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
  File "asr_wav2vec2.py", line 130, in <module>
    text = asr_transcript(tokenizer_100h,model_100h,processed_file)
  File "asr_wav2vec2.py", line 96, in asr_transcript
    logits = model(input_values).logits
  File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 795, in forward
    outputs = self.wav2vec2(
  File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 646, in forward
    encoder_outputs = self.encoder(
  File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 457, in forward
    hidden_states, attn_weights = layer(hidden_states, output_attentions=output_attentions)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 392, in forward
    hidden_states, attn_weights, _ = self.attention(hidden_states, output_attentions=output_attentions)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/joel/pyvenv/lib/python3.8/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 286, in forward
    attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 24373495488 bytes. Error code 12 (Cannot allocate memory)

Issue Analytics

State:
Created 3 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

9reactions

LysandreJikcommented, Feb 26, 2021

Okay, so the issue isn’t in the number of samples as I thought previously: there seems to be a single audio stream in your recording.

However, the issue here is that it’s a 7 minutes and 30 seconds long recording, which really is very very long. I talked about it with @patrickvonplaten, and he mentions that Wav2Vec2 was trained on ~40 seconds of recording maximum. What one could do here is split the recording in 30 seconds chunks. You’re using librosa and you can do that easily with librosa.stream.

Here for example your method to retrieve the transcript is the following:

def asr_transcript(tokenizer, model, input_file):
   
    speech, fs = sf.read(input_file)

    if len(speech.shape) > 1: 
        speech = speech[:,0] + speech[:,1]

    if fs != 16000:
        speech = librosa.resample(speech, fs, 16000)

    input_values = tokenizer(speech, return_tensors="pt").input_values
    logits = model(input_values).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = tokenizer.decode(predicted_ids[0])

    return correct_sentence(transcription.lower())

I’ve updated it to the following (please note that it’s the first time I’ve used librosa myself so the parameters I put for the stream values may be wrong):

def asr_transcript(tokenizer, model, input_file):
    transcript = ""
    # Ensure that the sample rate is 16k
    print(librosa.get_samplerate(input_file))

    # Stream over 30 seconds chunks rather than load the full file
    stream = librosa.stream(
        input_file,
        block_length=30,
        frame_length=16000,
        hop_length=16000
    )

    for speech in stream:
        if len(speech.shape) > 1:
            speech = speech[:, 0] + speech[:, 1]

        input_values = tokenizer(speech, return_tensors="pt").input_values
        logits = model(input_values).logits

        predicted_ids = torch.argmax(logits, dim=-1)
        transcription = tokenizer.decode(predicted_ids[0])
        transcript += correct_sentence(transcription.lower())

    return transcript

With this I seem to obtain sensible results! This could probably be improved in the following ways:

Ensure that the parameters passed to librosa.stream are correct. Changing these seem to have a very big impact on the transcript.
Patrick mentions that an advanced solution would be to use a Voice Activity detector to see where there is no speech and chunk there, for example finding a sequence of 100 values very close to zero, and cutting there. Little performance would be lost then.

0reactions

github-actions[bot]commented, Apr 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

Wav2vec2.0 memory issue - Models - Hugging Face Forums

I am training locally. I have 24gb gpu. Error is RuntimeError: CUDA out of memory. Tried to allocate 562.00 MiB (GPU 1; 23.65...

Running out of memory with pytorch - Stack Overflow

I am trying to train a model using huggingface's wav2vec for audio classification. I keep getting this error:

ffmpeg Error while filtering: Cannot allocate memory

I've recently been getting the following error: Error while filtering: Cannot allocate memory Failed to inject frame into filter network: Cannot ...

Resolving CUDA Being Out of Memory With Gradient ...

Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models which requires ...

transcode Cannot allocate memory - Linux - Emby Community

It feels like the requested memory exceeds the device itself. How to solve this problem thank you 07:05:15.190 [hevc @ 0x2df4000] Error ...