question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetune Hubert model : Adding new vocabulary

See original GitHub issue

Environment info

transformers version: 4.12.2
Platform: Mac
Python version: 3.7
PyTorch version (GPU?): 1.9
Tensorflow version (GPU?): No
Using GPU in script?: No
Using distributed or parallel setup in script?: No

I just run simple code to load Hubert pretrained base model

from transformers import Wav2Vec2Processor, HubertForCTC
import torch
import librosa
PROCESSOR = Wav2Vec2Processor.from_pretrained('facebook/hubert-large-ls960-ft')
model = HubertForCTC.from_pretrained('facebook/hubert-large-ls960-ft')
tokenizer = PROCESSOR.tokenizer

On a smaller dataset, i am able to get good WER around 0.0

But if I add new tokens/vocabulary to it by using the below code:

from transformers import Wav2Vec2Processor, HubertForCTC
import torch
import librosa
PROCESSOR = Wav2Vec2Processor.from_pretrained('facebook/hubert-large-ls960-ft')
model = HubertForCTC.from_pretrained('facebook/hubert-large-ls960-ft')
tokenizer = PROCESSOR.tokenizer
tokenizer.add_tokens(new_tokens=[' ','Ä','Ö','Ü'])

The loss and WER go bad and then worse (clearly), and later loss is NAN.

Is it the correct way to add new alphabets?

dataset is same in both trainings

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
harrypotter90commented, Nov 18, 2021

Cool, it worked. Thank you

2reactions
patrickvonplatencommented, Nov 18, 2021

Hey @harrypotter90,

exactly sorry I forgot to mention this parameter. To summarize, I would recommend to add new tokens and load your model as follows:

from transformers import Wav2Vec2Processor, AutoModelForCTC

# load tokenizer & feature extractor
processor = Wav2Vec2Processor.from_pretrained('facebook/hubert-large-ls960-ft')

# add new tokens
tokenizer = processor.tokenizer
tokenizer.add_tokens(new_tokens=['Ä','Ö','Ü'])

# load pretrained model and replace fine-tuned head with resized randomly initialized head
model = AutoModelForCTC.from_pretrained("facebook/hubert-large-ls960-ft", vocab_size=len(tokenizer), ignore_mismatched_sizes=True)

# now use model for training
Read more comments on GitHub >

github_iconTop Results From Across the Web

Hubert - Hugging Face
Hubert was proposed in HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung  ...
Read more >
Add training recipes for HuBERT model pre-training and ASR ...
To fine-tune the HuBERT model for customized down-stream task, people need to install and adopt their training pipeline to fairseq. It will be ......
Read more >
Fine Tune Pre-trained BERT model on new dataset(and vocab)
Yes, you can add the token to Bert's vocab. but it is not recommended way. because BERT uses a word-piece-based vocabulary, so it...
Read more >
Detect emotion in speech data: Fine-tuning HuBERT using ...
I have already covered how to create this script (in excruciating detail) in a ... Since we will be using the facebook/hubert-base-ls960 as...
Read more >
HuBERT: Self-Supervised Speech Representation Learning ...
Since we expect a pre-trained model to provide better representations than the raw acoustic feature such as MFCCs, we can create a new...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found