Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Any suggestions to handle longer text?

See original GitHub issue

I’m trying to do predictions with the pre-trained model and I keep running into the issue of;

Token indices sequence length is longer than the specified maximum sequence length for this model (1142 > 512). Running this sequence through the model will result in indexing errors
*** RuntimeError: The size of tensor a (1142) must match the size of tensor b (512) at non-singleton dimension 1

The issue is when I try to predict a text that is longer than 512, this happens. I understand this is because the string is long, other than chopping off the string. Is there any suggestions on how to deal with this problem with the package?

Thank you

Issue Analytics

State:
Created a year ago
Comments:5

Top GitHub Comments

4reactions

sorensenjscommented, Mar 23, 2022

Just a suggestion: taking the max over the splits, perhaps breaking at sentences would likely be better than averaging. The model tends to work as a detector, so finding any objectionable content in any part should disqualify the whole document.

1reaction

laurahanucommented, Mar 23, 2022

Hello! This package is not really designed for long form text and the transformer models used (e.g. BER, RoBERTa) have a max sequence length of 512. To get around this, one option would be to split your text into chunks, feed those to the model and then average the results, would that work for your case?