pipeline("sentiment-analysis') - index out of range in self
See original GitHub issueEnvironment info
transformers
version: 4.2.2- Platform: Manjaro Linux (Feb 2021)
- Python version: 3.8.5
- PyTorch version (GPU?): 1.7.1 (GPU)
- Tensorflow version (GPU?):
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Library:
- tokenizers: @n1t0, @LysandreJik
- pipelines: @LysandreJik
Information
Model I am using (Bert, XLNet …): distilbert-base-uncased-finetuned-sst-2-english
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: sentiment analysis
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
My dataset consists blog articles and comments on them. Sometimes there are non-english characters, code snippets or other weird sequences.
Error occurs when:
- Initialize the default pipeline(“sentiment-analysis”) with device 0 or -1
- Run inference classifier with truncation=True of my dataset
- After some time the classifier returns the following error:
CPU: Index out of range in self
GPU: /opt/conda/conda-bld/pytorch_1607370172916/work/aten/src/ATen/native/cuda/Indexing.cu:658: indexSelectLargeIndex: block: [56,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Expected behavior
I thought at first that my data was messing up the tokenization process or the model because sometimes there are strange sequences in the data e.g. code, links or stack traces.
However, if you name the model and tokenizer during pipeline initialization, inference from the model works fine for the same data:
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english', tokenizer='distilbert-base-uncased', device=0)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hello!
Thank you so much! That fixed the issue. I already thought the missing
max_length
could be the issue but it did not help to passmax_length = 512
to the call function of the pipeline.I used the truncation flag before but I guess it did not work due to the missing
max_length
value.Anyway, works perfectly now! Thank you!
Unfortunately this was due to the ill-configured tokenizer on the hub. We’re working on a more general fix to prevent this from happening in the future.
Happy to help!