Any suggestions to handle longer text?
See original GitHub issueI’m trying to do predictions with the pre-trained model and I keep running into the issue of;
Token indices sequence length is longer than the specified maximum sequence length for this model (1142 > 512). Running this sequence through the model will result in indexing errors
*** RuntimeError: The size of tensor a (1142) must match the size of tensor b (512) at non-singleton dimension 1
The issue is when I try to predict a text that is longer than 512, this happens. I understand this is because the string is long, other than chopping off the string. Is there any suggestions on how to deal with this problem with the package?
Thank you
Issue Analytics
- State:
- Created a year ago
- Comments:5
Top Results From Across the Web
Smart ways to handle long texts - Stack Overflow
I have a .docx document that is 28 pages long, it contains headings, sub-headings, italic text, bold text and some lists.
Read more >Test fonts readability: 5 Timeless Tips for long Text
5 Timeless Tips for long Text Readability · 1 – Get the font size right (yeah, it's important). · 2 – What font...
Read more >Get word suggestions & fix mistakes - Android - Messages Help
Open any app that you can type with, like Gmail or Keep. Tap where you can enter text. Type a word. At the...
Read more >Enable text suggestions in Windows - Microsoft Support
Use text suggestions to quickly complete words as you type a document, chat message, web form, or more.
Read more >5 Tips for Writing a Better Text Message
1.) Short and Sweet ... Effective messages are short and to the point. Kind of like that last sentence. Shorter messages are also...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Just a suggestion: taking the max over the splits, perhaps breaking at sentences would likely be better than averaging. The model tends to work as a detector, so finding any objectionable content in any part should disqualify the whole document.
Hello! This package is not really designed for long form text and the transformer models used (e.g. BER, RoBERTa) have a max sequence length of 512. To get around this, one option would be to split your text into chunks, feed those to the model and then average the results, would that work for your case?