question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Embedding index getting out of range while running camemebert model

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Camembert

Language I am using the model on (English, Chinese …): French

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Take a file with french text
  2. Load pretrained Camembert Model and tokenizer as in the doc
  3. Run inference
inputs = bert_tok.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = bert_tok.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = bert(**inputs)

It works by removing the context argument (text_pair argument) but I need it to do question answering with other models and it lead to the same error with pipelines

  • Stack trace :
IndexError                                Traceback (most recent call last)
<ipython-input-9-73762e6cf69b> in <module>
      2     for utterances in file.readlines():
      3         input_tensor = bert_tok.batch_encode_plus([utterances], pad_to_max_length=True, return_tensors="pt")
----> 4         last_hidden, pool = bert(input_tensor["input_ids"], input_tensor["attention_mask"])
      5 
      6 

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548                     functools.update_wrapper(wrapper, hook)
    549                     grad_fn.register_hook(wrapper)
--> 550         return result
    551 
    552     def __setstate__(self, state):

~/.local/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask)
    780             head_mask = [None] * self.config.num_hidden_layers
    781 
--> 782         embedding_output = self.embeddings(
    783             input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
    784         )

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548                     functools.update_wrapper(wrapper, hook)
    549                     grad_fn.register_hook(wrapper)
--> 550         return result
    551 
    552     def __setstate__(self, state):

~/.local/lib/python3.8/site-packages/transformers/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
     62                 position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
     63 
---> 64         return super().forward(
     65             input_ids, token_type_ids=token_type_ids, position_ids=position_ids, inputs_embeds=inputs_embeds
     66         )

~/.local/lib/python3.8/site-packages/transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
    172         if inputs_embeds is None:
    173             inputs_embeds = self.word_embeddings(input_ids)
--> 174         position_embeddings = self.position_embeddings(position_ids)
    175         token_type_embeddings = self.token_type_embeddings(token_type_ids)
    176 

~/.local/lib/python3.8/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    548                     functools.update_wrapper(wrapper, hook)
    549                     grad_fn.register_hook(wrapper)
--> 550         return result
    551 
    552     def __setstate__(self, state):

~/.local/lib/python3.8/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    110 
    111     def forward(self, input):
--> 112         return F.embedding(
    113             input, self.weight, self.padding_idx, self.max_norm,
    114             self.norm_type, self.scale_grad_by_freq, self.sparse)

~/.local/lib/python3.8/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1722     if dim == 3:
   1723         div = pad(div, (0, 0, size // 2, (size - 1) // 2))
-> 1724         div = avg_pool2d(div, (size, 1), stride=1).squeeze(1)
   1725     else:
   1726         sizes = input.size()

IndexError: index out of range in self

Expected behavior

Run inference without any error

Environment info

transformers version: 2.8.0

  • Platform: Linux-5.6.10-arch1-1-x86_64-with-glibc2.2.5
  • Python version: 3.8.2
  • PyTorch version (GPU?): 1.4.0 (True) (Same with 1.5)
  • Tensorflow version (GPU?): 2.2.0-rc4 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (8 by maintainers)

github_iconTop GitHub Comments

21reactions
jwallatcommented, May 11, 2020

I actually figured out my error. I was adding special tokens to the tokenizer (like begin-of-sequence) but did not resize the models token embeddings via: model.resize_token_embeddings(len(self.tokenizer)) Just in case someone else is not reading the documentation carefully enough 🙈 Considering that, the error message did actually make sense.

1reaction
LysandreJikcommented, May 11, 2020

It’s patched now, please install from source and there should be no error anymore!

Read more comments on GitHub >

github_iconTop Results From Across the Web

While training BERT variant, getting IndexError: index out of ...
1 Answer 1 · Mismatching vocabulary size of tokenizer and bert model. This will cause the tokenizer to generate IDs that the model...
Read more >
Embeddings index out of range error - PyTorch Forums
You have to check the range of the input tensor to the nn. Embedding layer and make sure its values are in [0,...
Read more >
CamemBERT - Hugging Face
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An CamemBERT...
Read more >
sentence-transformers 0.3.0 - PyPI
pip install -e . Getting Started. Sentences Embedding with a Pretrained Model. This example shows you how to use an already trained Sentence...
Read more >
Basics of BERT and XLM-RoBERTa - PyTorch - Kaggle
For example, a Bert model trained on a GPU is 600MB. However, a BERT model trained on a TPU is approx. 1GB. Therefore,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found