question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue: Adding new tokens to bert tokenizer in QA

See original GitHub issue

WARNING: This issue is a replica of this other issue open by me, I ask you sorry if I have open it in the wrong place.

Hello Huggingface’s team (@sgugger , @joeddav, @LysandreJik) I have a problem with this code base notebooks/examples/question_answering.ipynb - link ENV: Google Colab - transformers Version: 4.5.0; datasets Version: 1.5.0; torch Version: 1.8.1+cu101; I am trying to add some domain tokens in the bert-base-cased tokenizer

model_checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
list_of_domain_tokens = ["token1", "token2", "token3"]
tokenizer.add_tokens(list_of_domain_tokens)
...
...
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(model.device)  # cpu
model.resize_token_embeddings(len(tokenizer))
trainer = Trainer(...)

Then during the trainer.fit() call it report the attached error. Can you please tell me where I’m wrong? The tokenizer output is the usual bert inputs expressed in the form of List[List[int]] eg inputs_ids and attention_mask. So I can’t figure out where the problem is with the device Input, output and indices must be on the current device

Kind Regards, Andrea

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Apr 19, 2021

We will probably have to rethink the design then, since it’s not a simple “put on device if it wasn’t already” - there are multiple cases when it shouldn’t happen. For now added a hardcoded workaround: https://github.com/huggingface/transformers/pull/11322

1reaction
sguggercommented, Apr 19, 2021

I would rather avoid adding this, as users have been used to not have to set that argument to True when not using example scripts. Can we just add the proper line in train to put the model on the device if it was not done already?

(Sorry I didn’t catch you were using do_train in the PR you added that test, I should have caught it and commented there.)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Adding new tokens while preserving tokenization of adjacent ...
I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word....
Read more >
Adding new tokens to various models changes tokenization of ...
Steps to reproduce the behavior: (Distil)BERT: from transformers import DistilBertTokenizer, BertTokenizer new_word = 'mynewword' # BERT bt = ...
Read more >
Adding new tokens to BERT/RoBERTa while retaining ...
However, I'm running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when ......
Read more >
Adding a new token to a transformer model without breaking ...
If you manually edit those files to add the new tokens in the right way, everything seems to work as expected. Here's an...
Read more >
How to add new tokens to huggingface transformers vocabulary
In this short article, you'll learn how to add new tokens to the vocabulary of a huggingface transformer model.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found