Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BartTokenizer add_tokens feature.

See original GitHub issue

Hi @LysandreJik ,

I am working on audio captioning, and the ground truth captions are tokenized using the BartTokenizer. I have observed that some of the words in the captions are not tokenized correctly. For instance, the word ‘rumbling’. There is no such word in the tokenizer, and its tokenizing is [‘Ġr’, ‘umbling’]. I have tried to add the tokens(the word ‘Ġrumbling’)and change the model token embeddings. But instead of tokenizing the word correctly, it is still tokenizing as [‘Ġr’, ‘umbling’]. Did I miss anything here? I have also faced the same issues with the some other words too!

Here is my code!

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base", use_fast=True)
tokenizer.add_tokens(['Ġrumbling'])
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
model.resize_token_embeddings(len(tokenizer))
print(tokenizer.is_fast)
ou_e = 'the rain falls down while someone is pounding a car passes by and the thunder is rumbling'
tok_e = tokenizer(ou_e, max_length=64, return_tensors='pt', padding='max_length')
seq = tokenizer.tokenize(ou_e)
print(seq)
summary_ids = model.generate(tok_e['input_ids'], num_beams=4, min_length=5, max_length=100)
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(summary)

Issue Analytics

State:
Created a year ago
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

ydshiehcommented, Aug 22, 2022

@Charithavarma If you want to use the trained model facebook/bart-base, it’s always good to use the corresponding tokenizer. If you change the tokenizer (for example, here you add a new token, where the tokenization of sentences may change too - for some examples), it is normal that the model performance is affected (as it never sees the word/token rumbling before).

If adding new tokens is really important in your task, you probably would consider finetuning the original model with this changed tokenizer.

1reaction

SaulLucommented, Aug 18, 2022

By searching a little I realize that the current documentation is not very explicit on this point. I propose to detail it a little in the PR https://github.com/huggingface/transformers/pull/18687 ☺️