question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possibility to process long documents?

See original GitHub issue

I tried to set max_seq_length to over 512 and there is an error (see error info below). I think this is probably because the original BERT model is trained on 512 tokens?

Is there a possibility to process longer documents based on the current implementation or any simple adaptation (e.g. shared model weights for different chunks in a long document)?

I am thinking about adapting the code for long documents, if you have any suggestions on the implementation, please kindly let me know. Many thanks!


Error when using max_seq_length as 513 is provided below:

Traceback (most recent call last):
  File "xxx.py", line 98, in <module>
    model.train_model(train_df)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/classification/multi_label_classification_model.py", line 127, in train_model
    args=args,
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 262, in train_model
    **kwargs,
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 352, in train
    outputs = model(**inputs)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/simpletransformers/custom_models/models.py", line 47, in forward
    head_mask=head_mask,
  File "xxx/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/transformers/modeling_bert.py", line 799, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/transformers/modeling_bert.py", line 195, in forward
    embeddings = self.dropout(embeddings)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "xxx/anaconda/envs/pt100/lib/python3.6/site-packages/torch/nn/functional.py", line 749, in dropout
    else _VF.dropout(input, p, training))
RuntimeError: CUDA error: device-side assert triggered

Best wishes, A

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
ThilinaRajapaksecommented, Feb 22, 2020

Yes, BERT won’t let you go over 512 tokens. sliding_window is intended to help with this. It works by splitting longer documents into “windows” to keep everything under the length limit. It’s not going to work well in all cases but you can try it out and see.

3reactions
kinoutecommented, Feb 22, 2020

Have a look at the sliding_window feature in the README.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NLP Trend: A few examples on how to process long documents
It is clear that being able to process and extract key information from a database of large documents is extremely valuable for a...
Read more >
Efficient Classification of Long Documents ... - ACL Anthology
Several methods have been proposed for clas- sifying long textual documents using Trans- formers. However, there is a lack of consensus.
Read more >
A scalable Transformer architecture for summarizing long ...
Our objectives are to design a Transformer-based model that is capable of processing long documents; to make an architecture than can take ...
Read more >
Using BERT For Classifying Documents with Long Texts
Using BERT For Classifying Documents with Long Texts. How to fine-tuning Bert for inputs longer than a few words or sentences ...
Read more >
Process Documentation: Definition & Best Practices - Helpjuice
Process documentation maps out an ideal way of completing a workflow in your business. Here's everything you need to know and how to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found