Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IndexError: out of bounds

See original GitHub issue

This wave file: pl.zip

This code:

import torch, transformers, ctc_segmentation
import soundfile

# wav2vec2
model_file = 'jonatasgrosman/wav2vec2-large-xlsr-53-polish'
vocab_dict = {"<pad>": 0, "<s>": 1, "</s>": 2, "<unk>": 3, "|": 4, "A": 5, "I": 6, "E": 7, "O": 8, "Z": 9, "N": 10, "S": 11, "W": 12, "R": 13, "C": 14, "Y": 15, "M": 16, "T": 17, "D": 18, "K": 19, "P": 20, "Ł": 21, "J": 22, "U": 23, "L": 24, "B": 25, "Ę": 26, "G": 27, "Ą": 28, "Ż": 29, "H": 30, "Ś": 31, "Ó": 32, "Ć": 33, "F": 34, "Ń": 35, "Ź": 36, "V": 37, "-": 38, "Q": 39, "X": 40, "'": 41}

processor = transformers.Wav2Vec2Processor.from_pretrained( model_file )
model = transformers.Wav2Vec2ForCTC.from_pretrained( model_file )

speech_array, sampling_rate = soundfile.read( '/tmp/pl.wav' )
assert sampling_rate == 16000
features = processor(speech_array,sampling_rate=16000, return_tensors="pt")
input_values = features.input_values
attention_mask = features.attention_mask
with torch.no_grad():
    logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
transcription = transcription.lower().split()

# ctc-segmentation
with torch.no_grad():
    softmax = torch.nn.LogSoftmax(dim=-1)
    lpz = softmax(logits)[0].cpu().numpy()
config = ctc_segmentation.CtcSegmentationParameters()
config.index_duration = speech_array.shape[0] / lpz.shape[0] / sampling_rate
char_list = [x.lower() for x in vocab_dict.keys()]
ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, transcription,char_list)
timings, char_probs, state_list = ctc_segmentation.ctc_segmentation(config, lpz, ground_truth_mat)
segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)

Console:

Traceback (most recent call last):
  File "ctc.py", line 31, in <module>
    segments = ctc_segmentation.determine_utterance_segments(config, utt_begin_indices, char_probs, timings, transcription)
  File "/home/max/.local/lib/python3.8/site-packages/ctc_segmentation/ctc_segmentation.py", line 387, in determine_utterance_segments
    start = compute_time(utt_begin_indices[i], "begin")
  File "/home/max/.local/lib/python3.8/site-packages/ctc_segmentation/ctc_segmentation.py", line 380, in compute_time
    return max(timings[index + 1] - 0.5, middle)
IndexError: index 450 is out of bounds for axis 0 with size 450

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

3reactions

lumakucommented, Jun 8, 2021

So, the char list contains - as a character, the last utterance of the transcription consists of a single -, and at the same time, - is in the excluded characters list. Then, the last utterance is omitted / ignored in the ground_truth_mat and when obtaining the segments, the last utterance is missing.

You could solve this by updating the config object:

char_list = [x.lower() for x in vocab_dict.keys()]
config = ctc_segmentation.CtcSegmentationParameters(char_list=char_list) # note: char_list is set here instead of at prepare_text
config.update_exluded_characters()
config.index_duration = speech_array.shape[0] / lpz.shape[0] / sampling_rate
ground_truth_mat, utt_begin_indices = ctc_segmentation.prepare_text(config, transcription)

To circumvent such issues, you could directly use the token list that you obtained from the ASR model together with prepare_token_list.

1reaction

doublexcommented, Jun 8, 2021

Your code works really great. I wanted to make a joke. Use the example as you like (no need to mentioning me).