Possibility to process long documents?
See original GitHub issueI tried to set max_seq_length to over 512 and there is an error (see error info below). I think this is probably because the original BERT model is trained on 512 tokens?
Is there a possibility to process longer documents based on the current implementation or any simple adaptation (e.g. shared model weights for different chunks in a long document)?
I am thinking about adapting the code for long documents, if you have any suggestions on the implementation, please kindly let me know. Many thanks!
Error when using max_seq_length as 513 is provided below:
Traceback (most recent call last):
File "xxx.py", line 98, in <module>
model.train_model(train_df)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/classification/multi_label_classification_model.py", line 127, in train_model
args=args,
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 262, in train_model
**kwargs,
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 352, in train
outputs = model(**inputs)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/custom_models/models.py", line 47, in forward
head_mask=head_mask,
File "xxx/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/transformers/modeling_bert.py", line 799, in forward
input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/transformers/modeling_bert.py", line 195, in forward
embeddings = self.dropout(embeddings)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/functional.py", line 749, in dropout
else _VF.dropout(input, p, training))
RuntimeError: CUDA error: device-side assert triggered
Best wishes, A
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (7 by maintainers)
Top Results From Across the Web
NLP Trend: A few examples on how to process long documents
It is clear that being able to process and extract key information from a database of large documents is extremely valuable for a...
Read more >Efficient Classification of Long Documents ... - ACL Anthology
Several methods have been proposed for clas- sifying long textual documents using Trans- formers. However, there is a lack of consensus.
Read more >A scalable Transformer architecture for summarizing long ...
Our objectives are to design a Transformer-based model that is capable of processing long documents; to make an architecture than can take ...
Read more >Using BERT For Classifying Documents with Long Texts
Using BERT For Classifying Documents with Long Texts. How to fine-tuning Bert for inputs longer than a few words or sentences ...
Read more >Process Documentation: Definition & Best Practices - Helpjuice
Process documentation maps out an ideal way of completing a workflow in your business. Here's everything you need to know and how to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, BERT won’t let you go over 512 tokens.
sliding_window
is intended to help with this. It works by splitting longer documents into “windows” to keep everything under the length limit. It’s not going to work well in all cases but you can try it out and see.Have a look at the
sliding_window
feature in the README.