Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue: Adding new tokens to bert tokenizer in QA

See original GitHub issue

WARNING: This issue is a replica of this other issue open by me, I ask you sorry if I have open it in the wrong place.

Hello Huggingface’s team (@sgugger , @joeddav, @LysandreJik) I have a problem with this code base notebooks/examples/question_answering.ipynb - link ENV: Google Colab - transformers Version: 4.5.0; datasets Version: 1.5.0; torch Version: 1.8.1+cu101; I am trying to add some domain tokens in the bert-base-cased tokenizer

model_checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
list_of_domain_tokens = ["token1", "token2", "token3"]
tokenizer.add_tokens(list_of_domain_tokens)
...
...
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
print(model.device)  # cpu
model.resize_token_embeddings(len(tokenizer))
trainer = Trainer(...)

Then during the trainer.fit() call it report the attached error. Can you please tell me where I’m wrong? The tokenizer output is the usual bert inputs expressed in the form of List[List[int]] eg inputs_ids and attention_mask. So I can’t figure out where the problem is with the device Input, output and indices must be on the current device

Kind Regards, Andrea

Issue Analytics

State:
Created 2 years ago
Comments:14 (14 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Apr 19, 2021

We will probably have to rethink the design then, since it’s not a simple “put on device if it wasn’t already” - there are multiple cases when it shouldn’t happen. For now added a hardcoded workaround: https://github.com/huggingface/transformers/pull/11322

1reaction

sguggercommented, Apr 19, 2021

I would rather avoid adding this, as users have been used to not have to set that argument to True when not using example scripts. Can we just add the proper line in train to put the model on the device if it was not done already?

(Sorry I didn’t catch you were using do_train in the PR you added that test, I should have caught it and commented there.)

Top Results From Across the Web

Adding new tokens while preserving tokenization of adjacent ...

I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word....

Adding new tokens to various models changes tokenization of ...

Steps to reproduce the behavior: (Distil)BERT: from transformers import DistilBertTokenizer, BertTokenizer new_word = 'mynewword' # BERT bt = ...

Adding new tokens to BERT/RoBERTa while retaining ...

However, I'm running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when ......

Adding a new token to a transformer model without breaking ...

If you manually edit those files to add the new tokens in the right way, everything seems to work as expected. Here's an...

How to add new tokens to huggingface transformers vocabulary

In this short article, you'll learn how to add new tokens to the vocabulary of a huggingface transformer model.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Issue: Adding new tokens to bert tokenizer in QA

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Issue: Trainer error on `evaluate()` in multithreaded/distributed context (shape mismatch)

wav2vec 2.0 doesn't appear to do vector quantization