Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

See original GitHub issue

I’m trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It’s returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.

Below I’m attaching the code please look at it

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd

model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')

data.head()

Output is :

    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot or too humid wh...
3   HK Disneyland is a great compact park. Unfortu...
4   the location is not in the city, took around 1...

Followed by

classifier("My name is mark")

Output is

[{'label': 'POSITIVE', 'score': 0.9953688383102417}]

Followed by code

basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment

Output is

['POSITIVE']

Appending the total rows to empty list

text = []

for index, row in data.iterrows():
    text.append(row['Review'])

I’m trying to get the sentiment for all the rows

sent = []

for i in range(len(data)):
    sentiment = classifier(data.iloc[i,0])
    sent.append(sentiment)

The error is :

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
      2 
      3 for i in range(len(data)):
----> 4     sentiment = classifier(data.iloc[i,0])
      5     sent.append(sentiment)

11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self

Issue Analytics

State:
Created 2 years ago
Comments:11 (1 by maintainers)

Top GitHub Comments

2reactions

LysandreJikcommented, Apr 5, 2021

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

0reactions

Abe410commented, Jun 16, 2022

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

Hey

I am working on exactly the same problem as well. Does it really make the model more accurate?

Mind sharing the code with me as well? Thanks

Top Results From Across the Web

token indices sequence length is longer than the specified ...

This means you're encoding a sequence that is larger than the max sequence the model can handle (which is 512 tokens). This is...

Token indices sequence length is longer than the specified ...

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment ......

Token indices sequence length is ... - Hugging Face Forums

Running this sequence through the model will result in indexing errors. I have specified model_max_length =512 within the tokenizer. And passed ...

Pipelines - Hugging Face

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex...

DistilBERT - Hugging Face

It is used to instantiate a DistilBERT model according to the specified arguments, ... num_choices) ) — Indices of input sequence tokens in...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[docs] [sphinx] need to resolve cross-references for inherited/mixin methods

[DeepSpeed] ZeRO stage 3 integration: getting started and issues