Embedding index getting out of range while running camemebert model
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): Camembert
Language I am using the model on (English, Chinese …): French
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Take a file with french text
- Load pretrained Camembert Model and tokenizer as in the doc
- Run inference
-
Initialisation :
bert = CamembertModel.from_pretrained("camembert-base")
bert_tok = CamembertTokenizer.from_pretrained("camembert-base")
-
Inference : like https://huggingface.co/transformers/usage.html#question-answering
inputs = bert_tok.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = bert_tok.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = bert(**inputs)
It works by removing the context argument (text_pair argument) but I need it to do question answering with other models and it lead to the same error with pipelines
- Stack trace :
IndexError Traceback (most recent call last)
<ipython-input-9-73762e6cf69b> in <module>
2 for utterances in file.readlines():
3 input_tensor = bert_tok.batch_encode_plus([utterances], pad_to_max_length=True, return_tensors="pt")
----> 4 last_hidden, pool = bert(input_tensor["input_ids"], input_tensor["attention_mask"])
5
6
~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 functools.update_wrapper(wrapper, hook)
549 grad_fn.register_hook(wrapper)
--> 550 return result
551
552 def __setstate__(self, state):
~/.local/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask)
780 head_mask = [None] * self.config.num_hidden_layers
781
--> 782 embedding_output = self.embeddings(
783 input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
784 )
~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 functools.update_wrapper(wrapper, hook)
549 grad_fn.register_hook(wrapper)
--> 550 return result
551
552 def __setstate__(self, state):
~/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
62 position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
63
---> 64 return super().forward(
65 input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds
66 )
~/.local/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
172 if inputs_embeds is None:
173 inputs_embeds = self.word_embeddings(input_ids)
--> 174 position_embeddings = self.position_embeddings(position_ids)
175 token_type_embeddings = self.token_type_embeddings(token_type_ids)
176
~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
548 functools.update_wrapper(wrapper, hook)
549 grad_fn.register_hook(wrapper)
--> 550 return result
551
552 def __setstate__(self, state):
~/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
110
111 def forward(self, input):
--> 112 return F.embedding(
113 input, self.weight, self.padding_idx, self.max_norm,
114 self.norm_type, self.scale_grad_by_freq, self.sparse)
~/.local/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1722 if dim == 3:
1723 div = pad(div, (0, 0, size // 2, (size - 1) // 2))
-> 1724 div = avg_pool2d(div, (size, 1), stride=1).squeeze(1)
1725 else:
1726 sizes = input.size()
IndexError: index out of range in self
Expected behavior
Run inference without any error
Environment info
transformers
version: 2.8.0
- Platform: Linux-5.6.10-arch1-1-x86_64-with-glibc2.2.5
- Python version: 3.8.2
- PyTorch version (GPU?): 1.4.0 (True) (Same with 1.5)
- Tensorflow version (GPU?): 2.2.0-rc4 (False)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (8 by maintainers)
Top Results From Across the Web
While training BERT variant, getting IndexError: index out of ...
1 Answer 1 · Mismatching vocabulary size of tokenizer and bert model. This will cause the tokenizer to generate IDs that the model...
Read more >Embeddings index out of range error - PyTorch Forums
You have to check the range of the input tensor to the nn. Embedding layer and make sure its values are in [0,...
Read more >CamemBERT - Hugging Face
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An CamemBERT...
Read more >sentence-transformers 0.3.0 - PyPI
pip install -e . Getting Started. Sentences Embedding with a Pretrained Model. This example shows you how to use an already trained Sentence...
Read more >Basics of BERT and XLM-RoBERTa - PyTorch - Kaggle
For example, a Bert model trained on a GPU is 600MB. However, a BERT model trained on a TPU is approx. 1GB. Therefore,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I actually figured out my error. I was adding special tokens to the tokenizer (like begin-of-sequence) but did not resize the models token embeddings via:
model.resize_token_embeddings(len(self.tokenizer))
Just in case someone else is not reading the documentation carefully enough 🙈 Considering that, the error message did actually make sense.It’s patched now, please install from source and there should be no error anymore!