question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier

See original GitHub issue

I’m trying to get the sentiments for comments with the help of hugging face sentiment analysis pretrained model. It’s returning error like Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment classifier.

Below I’m attaching the code please look at it

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import transformers
import pandas as pd

model = AutoModelForSequenceClassification.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')
token = AutoTokenizer.from_pretrained('/content/drive/MyDrive/Huggingface-Sentiment-Pipeline')

classifier = pipeline(task='sentiment-analysis', model=model, tokenizer=token)

data = pd.read_csv('/content/drive/MyDrive/DisneylandReviews.csv', encoding='latin-1')

data.head()

Output is :

    Review
0   If you've ever been to Disneyland anywhere you...
1   Its been a while since d last time we visit HK...
2   Thanks God it wasn t too hot or too humid wh...
3   HK Disneyland is a great compact park. Unfortu...
4   the location is not in the city, took around 1...

Followed by

classifier("My name is mark")

Output is

[{'label': 'POSITIVE', 'score': 0.9953688383102417}]

Followed by code

basic_sentiment = [i['label'] for i in value if 'label' in i]
basic_sentiment

Output is

['POSITIVE']

Appending the total rows to empty list

text = []

for index, row in data.iterrows():
    text.append(row['Review'])

I’m trying to get the sentiment for all the rows

sent = []

for i in range(len(data)):
    sentiment = classifier(data.iloc[i,0])
    sent.append(sentiment)

The error is :

Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-19-4bb136563e7c> in <module>()
      2 
      3 for i in range(len(data)):
----> 4     sentiment = classifier(data.iloc[i,0])
      5     sent.append(sentiment)

11 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1914         # remove once script supports set_grad_enabled
   1915         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1916     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1917 
   1918 

IndexError: index out of range in self

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
LysandreJikcommented, Apr 5, 2021

Could you specify truncation=True when calling the pipeline with your data?

Replacing classifier("My name is mark") by classifier("My name is mark", truncation=True)

0reactions
Abe410commented, Jun 16, 2022

Thanks so much for your help! I also digged in a bit further…It seems the Roberta model I was using is only capable to use 286 words per token? (I used a exemplary text and cut it down until it ran). Might be the easiest way to pre-process the data first rather than using the truncation within the classifier.

Actually, you can train your custom model on top of pre-trained models if you have content and its respective class. That makes the model much accurate. I have a bert code if you want I can give it to you.

Hey

I am working on exactly the same problem as well. Does it really make the model more accurate?

Mind sharing the code with me as well? Thanks

Read more comments on GitHub >

github_iconTop Results From Across the Web

token indices sequence length is longer than the specified ...
This means you're encoding a sequence that is larger than the max sequence the model can handle (which is 512 tokens). This is...
Read more >
Token indices sequence length is longer than the specified ...
Token indices sequence length is longer than the specified maximum sequence length for this model (651 > 512) with Hugging face sentiment ......
Read more >
Token indices sequence length is ... - Hugging Face Forums
Running this sequence through the model will result in indexing errors. I have specified model_max_length =512 within the tokenizer. And passed ...
Read more >
Pipelines - Hugging Face
The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex...
Read more >
DistilBERT - Hugging Face
It is used to instantiate a DistilBERT model according to the specified arguments, ... num_choices) ) — Indices of input sequence tokens in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found