Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pipeline("sentiment-analysis') - index out of range in self

See original GitHub issue

Environment info

transformers version: 4.2.2
Platform: Manjaro Linux (Feb 2021)
Python version: 3.8.5
PyTorch version (GPU?): 1.7.1 (GPU)
Tensorflow version (GPU?):
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

Library:

tokenizers: @n1t0, @LysandreJik
pipelines: @LysandreJik

Information

Model I am using (Bert, XLNet …): distilbert-base-uncased-finetuned-sst-2-english

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: sentiment analysis
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

My dataset consists blog articles and comments on them. Sometimes there are non-english characters, code snippets or other weird sequences.

Error occurs when:

Initialize the default pipeline(“sentiment-analysis”) with device 0 or -1
Run inference classifier with truncation=True of my dataset
After some time the classifier returns the following error:

CPU: Index out of range in self

GPU: /opt/conda/conda-bld/pytorch_1607370172916/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [56,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Expected behavior

I thought at first that my data was messing up the tokenization process or the model because sometimes there are strange sequences in the data e.g. code, links or stack traces.

However, if you name the model and tokenizer during pipeline initialization, inference from the model works fine for the same data:

classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english', tokenizer='distilbert-base-uncased', device=0)

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

nikchhacommented, Feb 9, 2021

Hello!

Thank you so much! That fixed the issue. I already thought the missing max_length could be the issue but it did not help to pass max_length = 512 to the call function of the pipeline.

I used the truncation flag before but I guess it did not work due to the missing max_length value.

Anyway, works perfectly now! Thank you!

0reactions

LysandreJikcommented, Feb 9, 2021

Unfortunately this was due to the ill-configured tokenizer on the hub. We’re working on a more general fix to prevent this from happening in the future.

Happy to help!