question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BartTokenizer add_tokens feature.

See original GitHub issue

Hi @LysandreJik ,

I am working on audio captioning, and the ground truth captions are tokenized using the BartTokenizer. I have observed that some of the words in the captions are not tokenized correctly. For instance, the word ‘rumbling’. There is no such word in the tokenizer, and its tokenizing is [‘Ġr’, ‘umbling’]. I have tried to add the tokens(the word ‘Ġrumbling’)and change the model token embeddings. But instead of tokenizing the word correctly, it is still tokenizing as [‘Ġr’, ‘umbling’]. Did I miss anything here? I have also faced the same issues with the some other words too!

Here is my code!

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base", use_fast=True)
tokenizer.add_tokens(['Ġrumbling'])
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
model.resize_token_embeddings(len(tokenizer))
print(tokenizer.is_fast)
ou_e = 'the rain falls down while someone is pounding a car passes by and the thunder is rumbling'
tok_e = tokenizer(ou_e, max_length=64, return_tensors='pt', padding='max_length')
seq = tokenizer.tokenize(ou_e)
print(seq)
summary_ids = model.generate(tok_e['input_ids'], num_beams=4, min_length=5, max_length=100)
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(summary)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ydshiehcommented, Aug 22, 2022

@Charithavarma If you want to use the trained model facebook/bart-base, it’s always good to use the corresponding tokenizer. If you change the tokenizer (for example, here you add a new token, where the tokenization of sentences may change too - for some examples), it is normal that the model performance is affected (as it never sees the word/token rumbling before).

If adding new tokens is really important in your task, you probably would consider finetuning the original model with this changed tokenizer.

1reaction
SaulLucommented, Aug 18, 2022

By searching a little I realize that the current documentation is not very explicit on this point. I propose to detail it a little in the PR https://github.com/huggingface/transformers/pull/18687 ☺️

Read more comments on GitHub >

github_iconTop Results From Across the Web

BART - Hugging Face
Resources. A list of official Hugging Face and community (indicated by ) resources to help you get started with BART. If you're interested...
Read more >
Special tokens to pre-trained BART model #3446 - GitHub
I am only familiar with the add_special_tokens functionality for new tokens that get the "special tokens" treatment.
Read more >
Adding new tokens to BERT/RoBERTa while retaining ...
I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word....
Read more >
How to use BERT from the Hugging Face transformer library
The add special tokens parameter is just for BERT to add tokens like ... I typically use the tokenizer.encode_plus() function to tokenize my ......
Read more >
Adding a new token to a transformer model without breaking ...
... is a helper function to loop through a list of new tokens and get the byte-pair encodings # such that the new...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found