[Bug]: Space token ' ' found in vocabulary even though it looks like BPE.
See original GitHub issueDescribe the bug
Dear Maintainers,
I am currently trying to improve the ASR predictions of the Wav2Vec2 + CTC + RNN
recipe for CommonVoice by adding an ARPA model during the inference with pyctcdecode
based on this issue.
So, I trained my Wav2Vec2 + CTC + RNN by using the characters and BPE based tokenization of SentencePiece and adapt it to works with the EncoderASR.from_hparams
method from SpeechBrain. Everything goes well until now.
Then, I build my ARPA model by using SriLM from Kaldi on the data used for building the tokenizer as inputs.
A sample of the data used for the ARPA model :
desogestrel 75 microgrammes à prendre 1 comprimé chaque jour vers midi toujours à la même heure pour 3 mois
lantus solostar 10 unités le soir au coucher pendant 3 mois
tranxene 5 milligrammes 1 gélule le midi 1 gélule au coucher pendant 6 mois
tolexine 50 milligrammes 2 comprimés le soir pendant 15 jours puis 1 le soir ensuite traitement pour 4 semaines à renouveler 1 fois
vitamine c effervescent 1 comprimé dans 1 verre d'eau 1 fois par jour quantité suffisante pour 3 mois
bandelette et lancette adaptées au lecteur de glycémie
1 comprimé le matin et 1 comprimé le soir pendant 8 semaines
A sample of the vocabulary obtains on this file by splitting it into whitespace and removing duplicated terms :
125
caféine
crème
revei/
prononcer
k/
applications
ronipyrole
prescrit
trimetazidine
pourcent
macrogol
sodique
spectral
72
kilo
zolpidem
m
ip
500
système
Command line used to build the ARPA with 3-grams :
order=3
file=train.txt
/users/ylabrak/kaldi/tools/srilm/bin/i686-m64/ngram-count -unk -vocab vocab.txt -interpolate -kndiscount -gt1min 1 -gt2min 1 -gt3min 1 -gt4min 1 -gt5min 1 -text ${file}.txt -order ${order} -lm ${file}.arpa
Then, I make my script for the inference :
import torchaudio
import torch
from pyctcdecode import build_ctcdecoder
from speechbrain.pretrained import EncoderASR
asr_model = EncoderASR.from_hparams(
source="Run_8896_Chars_No_Underscore",
savedir="pretrained_models/Run_8896_Chars_No_Underscore",
)
audio, sr = torchaudio.load('recording_1.wav')
rel_length = torch.tensor([1.0])
encoder_out = asr_model.encode_batch(audio,rel_length)
labels = [asr_model.tokenizer.id_to_piece(id).lower() for id in range(asr_model.tokenizer.get_piece_size())]
labels[1]=' '
labels[0] = '<pad>'
# !!! CRASH HERE !!!
decoder = build_ctcdecoder(
labels,
"/users/ylabrak/MedicalASR/Inference_ASR/Run_8896_Chars_No_Underscore/ARPA/train.arpa",
alpha=0.6,
)
res = decoder.decode(encoder_out[0].cpu().numpy())
print(res)
And obtains the following error :
(speechbrain_39) ylabrak@helios:~/XXXXX/Inference_ASR$ python predict_one_with_LM.py
Loading the LM will be faster if you build a binary file.
Reading /users/ylabrak/XXXXX/Inference_ASR/model.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Traceback (most recent call last):
File "/users/ylabrak/XXXXX/Inference_ASR/predict_one_with_LM.py", line 34, in <module>
decoder = build_ctcdecoder(labels, "model.arpa", alpha=0.6)
File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/decoder.py", line 873, in build_ctcdecoder
alphabet = Alphabet.build_alphabet(labels)
File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 143, in build_alphabet
_verify_alphabet(labels, is_bpe)
File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 120, in _verify_alphabet
raise ValueError("Space token ' ' found in vocabulary even though it looks like BPE.")
ValueError: Space token ' ' found in vocabulary even though it looks like BPE.
The ASR characters vocabulary using SentencePiece :
The ARPA model itself :
Expected behaviour
Loading the ARPA model without any crash and make prediction with the LM after the Wav2Vec2 FR + CTC + RNN.
To Reproduce
import torchaudio
import torch
from pyctcdecode import build_ctcdecoder
from speechbrain.pretrained import EncoderASR
asr_model = EncoderASR.from_hparams(
source="Run_8896_Chars_No_Underscore",
savedir="pretrained_models/Run_8896_Chars_No_Underscore",
)
audio, sr = torchaudio.load('recording_1.wav')
rel_length = torch.tensor([1.0])
encoder_out = asr_model.encode_batch(audio,rel_length)
labels = [asr_model.tokenizer.id_to_piece(id).lower() for id in range(asr_model.tokenizer.get_piece_size())]
labels[1]=' '
labels[0] = '<pad>'
# !!! CRASH HERE !!!
decoder = build_ctcdecoder(
labels,
"/users/ylabrak/MedicalASR/Inference_ASR/Run_8896_Chars_No_Underscore/ARPA/train.arpa",
alpha=0.6,
)
res = decoder.decode(encoder_out[0].cpu().numpy())
print(res)
Versions
Name: speechbrain
Version: 0.5.13
Summary: All-in-one speech toolkit in pure Python and Pytorch
Home-page: https://speechbrain.github.io/
Author: Mirco Ravanelli & Others
Author-email: speechbrain@gmail.com
License:
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: huggingface-hub, hyperpyyaml, joblib, numpy, packaging, scipy, sentencepiece, torch, torchaudio, tqdm
Required-by:
Name: torch
Version: 1.11.0+cu113
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: typing-extensions
Required-by: flair, hugsvision, pytorch-lightning, speechbrain, timm, torchaudio, torchmetrics, torchtext, torchvision
Name: torchaudio
Version: 0.11.0
Summary: An audio package for PyTorch
Home-page: https://github.com/pytorch/audio
Author: Soumith Chintala, David Pollack, Sean Naren, Peter Goldsborough
Author-email: soumith@pytorch.org
License: UNKNOWN
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: torch
Required-by: speechbrain
Name: pyctcdecode
Version: 0.4.0
Summary: CTC beam search decoder for speech recognition.
Home-page: https://github.com/kensho-technologies/pyctcdecode
Author: Kensho Technologies, LLC.
Author-email: pyctcdecode-maintainer@kensho.com
License: Apache 2.0
Location: /users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages
Requires: hypothesis, numpy, pygtrie
Required-by:
Relevant log output
(speechbrain_39) ylabrak@helios:~/XXXXX/Inference_ASR$ python predict_one_with_LM.py
Loading the LM will be faster if you build a binary file.
Reading /users/ylabrak/XXXXX/Inference_ASR/model.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Traceback (most recent call last):
File "/users/ylabrak/XXXXX/Inference_ASR/predict_one_with_LM.py", line 34, in <module>
decoder = build_ctcdecoder(labels, "model.arpa", alpha=0.6)
File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/decoder.py", line 873, in build_ctcdecoder
alphabet = Alphabet.build_alphabet(labels)
File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 143, in build_alphabet
_verify_alphabet(labels, is_bpe)
File "/users/ylabrak/.conda/envs/speechbrain_39/lib/python3.9/site-packages/pyctcdecode/alphabet.py", line 120, in _verify_alphabet
raise ValueError("Space token ' ' found in vocabulary even though it looks like BPE.")
ValueError: Space token ' ' found in vocabulary even though it looks like BPE.
Additional context
It’s trained on French.
The inference is made on CPU.
If you need the model for prediction, don’t hesitate, I can send you everything by email.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
@Antoine-Caubriere @Moumeneb1 please help 😄
Hi !
The issue is with pyctcdecode, actually they check if any of the labels starts with “▁” they assume it’s a BPE which is not the case for your character level tokenizer. the third label in your list is the “▁” and it’s the one that makes pyctcdecode think it’s a bpe.
I don’t know what “▁” stands for really in your tokenizer, maybe it’s the space so can just replace it with space " " otherwise a quick fix (not super clean) is to replace with a token “?” or anything just to run your experiment and then remove it properly from your training vocab.