BartTokenizer add_tokens feature.
See original GitHub issueHi @LysandreJik ,
I am working on audio captioning, and the ground truth captions are tokenized using the BartTokenizer. I have observed that some of the words in the captions are not tokenized correctly. For instance, the word ‘rumbling’. There is no such word in the tokenizer, and its tokenizing is [‘Ġr’, ‘umbling’]. I have tried to add the tokens(the word ‘Ġrumbling’)and change the model token embeddings. But instead of tokenizing the word correctly, it is still tokenizing as [‘Ġr’, ‘umbling’]. Did I miss anything here? I have also faced the same issues with the some other words too!
Here is my code!
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base", use_fast=True)
tokenizer.add_tokens(['Ġrumbling'])
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
model.resize_token_embeddings(len(tokenizer))
print(tokenizer.is_fast)
ou_e = 'the rain falls down while someone is pounding a car passes by and the thunder is rumbling'
tok_e = tokenizer(ou_e, max_length=64, return_tensors='pt', padding='max_length')
seq = tokenizer.tokenize(ou_e)
print(seq)
summary_ids = model.generate(tok_e['input_ids'], num_beams=4, min_length=5, max_length=100)
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(summary)
Issue Analytics
- State:
- Created a year ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
BART - Hugging Face
Resources. A list of official Hugging Face and community (indicated by ) resources to help you get started with BART. If you're interested...
Read more >Special tokens to pre-trained BART model #3446 - GitHub
I am only familiar with the add_special_tokens functionality for new tokens that get the "special tokens" treatment.
Read more >Adding new tokens to BERT/RoBERTa while retaining ...
I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word....
Read more >How to use BERT from the Hugging Face transformer library
The add special tokens parameter is just for BERT to add tokens like ... I typically use the tokenizer.encode_plus() function to tokenize my ......
Read more >Adding a new token to a transformer model without breaking ...
... is a helper function to loop through a list of new tokens and get the byte-pair encodings # such that the new...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Charithavarma If you want to use the trained model
facebook/bart-base
, it’s always good to use the corresponding tokenizer. If you change the tokenizer (for example, here you add a new token, where the tokenization of sentences may change too - for some examples), it is normal that the model performance is affected (as it never sees the word/tokenrumbling
before).If adding new tokens is really important in your task, you probably would consider finetuning the original model with this changed tokenizer.
By searching a little I realize that the current documentation is not very explicit on this point. I propose to detail it a little in the PR https://github.com/huggingface/transformers/pull/18687 ☺️