Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug]: Space token ' ' found in vocabulary even though it looks like BPE.

See original GitHub issue

Describe the bug

Dear Maintainers,

I am currently trying to improve the ASR predictions of the Wav2Vec2 + CTC + RNN recipe for CommonVoice by adding an ARPA model during the inference with pyctcdecode based on this issue.

So, I trained my Wav2Vec2 + CTC + RNN by using the characters and BPE based tokenization of SentencePiece and adapt it to works with the EncoderASR.from_hparams method from SpeechBrain. Everything goes well until now.

Then, I build my ARPA model by using SriLM from Kaldi on the data used for building the tokenizer as inputs.

A sample of the data used for the ARPA model :

desogestrel 75 microgrammes à prendre 1 comprimé chaque jour vers midi toujours à la même heure pour 3 mois
lantus solostar 10 unités le soir au coucher pendant 3 mois
tranxene 5 milligrammes 1 gélule le midi 1 gélule au coucher pendant 6 mois
tolexine 50 milligrammes 2 comprimés le soir pendant 15 jours puis 1 le soir ensuite traitement pour 4 semaines à renouveler 1 fois
vitamine c effervescent 1 comprimé dans 1 verre d'eau 1 fois par jour quantité suffisante pour 3 mois
bandelette et lancette adaptées au lecteur de glycémie
1 comprimé le matin et 1 comprimé le soir pendant 8 semaines

A sample of the vocabulary obtains on this file by splitting it into whitespace and removing duplicated terms :

125
caféine
crème
revei/
prononcer
k/
applications
ronipyrole
prescrit
trimetazidine
pourcent
macrogol
sodique
spectral
72
kilo
zolpidem
m
ip
500
système

Command line used to build the ARPA with 3-grams :

order=3
file=train.txt
/users/ylabrak/kaldi/tools/srilm/bin/i686-m64/ngram-count -unk -vocab vocab.txt -interpolate -kndiscount -gt1min 1 -gt2min 1  -gt3min 1  -gt4min 1 -gt5min 1 -text ${file}.txt -order ${order} -lm ${file}.arpa

Then, I make my script for the inference :

import torchaudio
import torch
from pyctcdecode import build_ctcdecoder
from speechbrain.pretrained import EncoderASR

asr_model = EncoderASR.from_hparams(
    source="Run_8896_Chars_No_Underscore",
    savedir="pretrained_models/Run_8896_Chars_No_Underscore",
)

audio, sr  = torchaudio.load('recording_1.wav')
rel_length = torch.tensor([1.0])

encoder_out  = asr_model.encode_batch(audio,rel_length)

labels  = [asr_model.tokenizer.id_to_piece(id).lower() for id in range(asr_model.tokenizer.get_piece_size())]

labels[1]=' '
labels[0] = '<pad>'

# !!! CRASH HERE !!!
decoder = build_ctcdecoder(
    labels,
    "/users/ylabrak/MedicalASR/Inference_ASR/Run_8896_Chars_No_Underscore/ARPA/train.arpa",
    alpha=0.6,
)

res = decoder.decode(encoder_out[0].cpu().numpy())
print(res)

And obtains the following error :

(speechbrain_39) ylabrak@helios:~/XXXXX/Inference_ASR$ python predict_one_with_LM.py
Loading the LM will be faster if you build a binary file.
Reading /users/ylabrak/XXXXX/Inference_ASR/model.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Traceback (most recent call last):
  File "/users/ylabrak/XXXXX/Inference_ASR/predict_one_with_LM.py", line 34, in <module>
    decoder = build_ctcdecoder(labels, "model.arpa", alpha=0.6)
  File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/decoder.py", line 873, in build_ctcdecoder
    alphabet = Alphabet.build_alphabet(labels)
  File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 143, in build_alphabet
    _verify_alphabet(labels, is_bpe)
  File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 120, in _verify_alphabet
    raise ValueError("Space token ' ' found in vocabulary even though it looks like BPE.")
ValueError: Space token ' ' found in vocabulary even though it looks like BPE.

The ASR characters vocabulary using SentencePiece :

100_char.vocab.txt

The ARPA model itself :

train.arpa.txt

Expected behaviour

Loading the ARPA model without any crash and make prediction with the LM after the Wav2Vec2 FR + CTC + RNN.

To Reproduce

import torchaudio
import torch
from pyctcdecode import build_ctcdecoder
from speechbrain.pretrained import EncoderASR

asr_model = EncoderASR.from_hparams(
    source="Run_8896_Chars_No_Underscore",
    savedir="pretrained_models/Run_8896_Chars_No_Underscore",
)

audio, sr  = torchaudio.load('recording_1.wav')
rel_length = torch.tensor([1.0])

encoder_out  = asr_model.encode_batch(audio,rel_length)

labels  = [asr_model.tokenizer.id_to_piece(id).lower() for id in range(asr_model.tokenizer.get_piece_size())]

labels[1]=' '
labels[0] = '<pad>'

# !!! CRASH HERE !!!
decoder = build_ctcdecoder(
    labels,
    "/users/ylabrak/MedicalASR/Inference_ASR/Run_8896_Chars_No_Underscore/ARPA/train.arpa",
    alpha=0.6,
)

res = decoder.decode(encoder_out[0].cpu().numpy())
print(res)

Versions

Name: speechbrain
Version: 0.5.13
Summary: All-in-one speech toolkit in pure Python and Pytorch
Home-page: https://speechbrain.github.io/
Author: Mirco Ravanelli & Others
Author-email: speechbrain@gmail.com
License:
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: huggingface-hub, hyperpyyaml, joblib, numpy, packaging, scipy, sentencepiece, torch, torchaudio, tqdm
Required-by:

Name: torch
Version: 1.11.0+cu113
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: typing-extensions
Required-by: flair, hugsvision, pytorch-lightning, speechbrain, timm, torchaudio, torchmetrics, torchtext, torchvision

Name: torchaudio
Version: 0.11.0
Summary: An audio package for PyTorch
Home-page: https://github.com/pytorch/audio
Author: Soumith Chintala, David Pollack, Sean Naren, Peter Goldsborough
Author-email: soumith@pytorch.org
License: UNKNOWN
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: torch
Required-by: speechbrain

Name: pyctcdecode
Version: 0.4.0
Summary: CTC beam search decoder for speech recognition.
Home-page: https://github.com/kensho-technologies/pyctcdecode
Author: Kensho Technologies, LLC.
Author-email: pyctcdecode-maintainer@kensho.com
License: Apache 2.0
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: hypothesis, numpy, pygtrie
Required-by:

Relevant log output

(speechbrain_39) ylabrak@helios:~/XXXXX/Inference_ASR$ python predict_one_with_LM.py
Loading the LM will be faster if you build a binary file.
Reading /users/ylabrak/XXXXX/Inference_ASR/model.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Traceback (most recent call last):
  File "/users/ylabrak/XXXXX/Inference_ASR/predict_one_with_LM.py", line 34, in <module>
    decoder = build_ctcdecoder(labels, "model.arpa", alpha=0.6)
  File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/decoder.py", line 873, in build_ctcdecoder
    alphabet = Alphabet.build_alphabet(labels)
  File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 143, in build_alphabet
    _verify_alphabet(labels, is_bpe)
  File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 120, in _verify_alphabet
    raise ValueError("Space token ' ' found in vocabulary even though it looks like BPE.")
ValueError: Space token ' ' found in vocabulary even though it looks like BPE.

Additional context

It’s trained on French.

The inference is made on CPU.

If you need the model for prediction, don’t hesitate, I can send you everything by email.

Issue Analytics

State:
Created a year ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

TParcolletcommented, Oct 24, 2022

@Antoine-Caubriere @Moumeneb1 please help 😄

1reaction

Moumeneb1commented, Nov 2, 2022

Hi !

The issue is with pyctcdecode, actually they check if any of the labels starts with “▁” they assume it’s a BPE which is not the case for your character level tokenizer. the third label in your list is the “▁” and it’s the one that makes pyctcdecode think it’s a bpe.

I don’t know what “▁” stands for really in your tokenizer, maybe it’s the space so can just replace it with space " " otherwise a quick fix (not super clean) is to replace with a token “?” or anything just to run your experiment and then remove it properly from your training vocab.

Top Results From Across the Web

BPE tokenizers and spaces before words - Transformers

Hi,. The documentation for GPT2Tokenizer suggests that we should keep the default of not adding spaces before words ( add_prefix_space=False ) ...

Bpe vocabulary alternative format · Issue #22 - GitHub

I am trying to use the decoder with logits of BPE vocabulary, But my BPE notation is different than yours. Example: I_ am_...

NLP Tokenization - Medium

OOV (Out Of Vocabulary) is the major problem with word tokenizer. When the unseen word comes in testing this method failes. However this...

Dynamic Acoustic Unit Augmentation with BPE-Dropout for ...

BPE -dropout was beneficially used for low-resource MT tasks as a standalone ... defines a simple deterministic mapping of words to subword tokens....

Regular Expressions, Text Normalization, Edit Distance

do with language relies on first separating out or tokenizing words from ... But there is still one more problem with this pattern:...