Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetune Hubert model : Adding new vocabulary

See original GitHub issue

Environment info

transformers version: 4.12.2
Platform: Mac
Python version: 3.7
PyTorch version (GPU?): 1.9
Tensorflow version (GPU?): No
Using GPU in script?: No
Using distributed or parallel setup in script?: No

I just run simple code to load Hubert pretrained base model

from transformers import Wav2Vec2Processor, HubertForCTC
import torch
import librosa
PROCESSOR = Wav2Vec2Processor.from_pretrained('facebook/hubert-large-ls960-ft')
model = HubertForCTC.from_pretrained('facebook/hubert-large-ls960-ft')
tokenizer = PROCESSOR.tokenizer

On a smaller dataset, i am able to get good WER around 0.0

But if I add new tokens/vocabulary to it by using the below code:

from transformers import Wav2Vec2Processor, HubertForCTC
import torch
import librosa
PROCESSOR = Wav2Vec2Processor.from_pretrained('facebook/hubert-large-ls960-ft')
model = HubertForCTC.from_pretrained('facebook/hubert-large-ls960-ft')
tokenizer = PROCESSOR.tokenizer
tokenizer.add_tokens(new_tokens=[' ','Ä','Ö','Ü'])

The loss and WER go bad and then worse (clearly), and later loss is NAN.

Is it the correct way to add new alphabets?

dataset is same in both trainings

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:9 (5 by maintainers)

Top GitHub Comments

2reactions

harrypotter90commented, Nov 18, 2021

Cool, it worked. Thank you

2reactions

patrickvonplatencommented, Nov 18, 2021

Hey @harrypotter90,

exactly sorry I forgot to mention this parameter. To summarize, I would recommend to add new tokens and load your model as follows:

from transformers import Wav2Vec2Processor, AutoModelForCTC

# load tokenizer & feature extractor
processor = Wav2Vec2Processor.from_pretrained('facebook/hubert-large-ls960-ft')

# add new tokens
tokenizer = processor.tokenizer
tokenizer.add_tokens(new_tokens=['Ä','Ö','Ü'])

# load pretrained model and replace fine-tuned head with resized randomly initialized head
model = AutoModelForCTC.from_pretrained("facebook/hubert-large-ls960-ft", vocab_size=len(tokenizer), ignore_mismatched_sizes=True)

# now use model for training

Top Results From Across the Web

Hubert - Hugging Face

Hubert was proposed in HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung ...

Add training recipes for HuBERT model pre-training and ASR ...

To fine-tune the HuBERT model for customized down-stream task, people need to install and adopt their training pipeline to fairseq. It will be ......

Fine Tune Pre-trained BERT model on new dataset(and vocab)

Yes, you can add the token to Bert's vocab. but it is not recommended way. because BERT uses a word-piece-based vocabulary, so it...

Detect emotion in speech data: Fine-tuning HuBERT using ...

I have already covered how to create this script (in excruciating detail) in a ... Since we will be using the facebook/hubert-base-ls960 as...

HuBERT: Self-Supervised Speech Representation Learning ...

Since we expect a pre-trained model to provide better representations than the raw acoustic feature such as MFCCs, we can create a new...