question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Impact of "shorter" documents (span, number of tokens) for extended pretraining

See original GitHub issue

I am currently trying to use DeCLUTR for extended pretraining in a multilingual setting for documents of a domain-specific purpose.

I chose to use sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and have approximately 100k documents of varying length. Thus, I quickly ran into the “known” token/span length error (see https://github.com/JohnGiorgi/DeCLUTR/blob/73a19313cd1707ce6a7a678451f41a3091205d4e/declutr/common/contrastive_utils.py#L48)

Since I cannot change the data basis I have now, I tried to adjust the span lengths, namely max_length and min_length in the config declutr.jsonnet (everything else remained as is) and filtered my dataset to adhere to the minimum token length which does work in this particular setting. The results are I have ~10k documents left - enough for extended pretraining for a specific domain?

I ended up using min_length = 8 and max_length = 32 with a minimum token length of 128 in each document (derives from these settings since max_length is multiplied by 4 in the DeCLUTR setting).

My question is: Does this make sense (reducing min or max lengths for spans and using only “longer” documents) or how can I approach this issue when facing “not so long documents” as for instance in the example-wise wikitext-103 settings etc.?

Are there maybe some hints or “rules of thumbs” I can follow?

Thanks a lot for your help!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
repodiaccommented, Sep 28, 2021

OK, but you agree that you require documents for training to be longer (with 2 anchors, at least 4 x max_length) than you actually support for inference! This might be a serious issue for practical use, at least in my case.

(I think you are saying 4 because [num_anchors == 2 by default?]

yes

This doesn’t mean the model actually sees text of this length. It only ever sees text from token length min_length up to token length max_length. I hope that is clear. I would encourage you to check out our paper for more details, but also feel free to ask follow-up questions.

Have only skimmed the paper, to be honest 😃

Again, I would try plotting the token length of your documents. This would give you a better sense of whether or not DeCLUTR is suitable.

Ok, will do.

I would also check out the training notebook and the preprocess_wikitext_103.py scripts if you have not. They demonstrate the process of calculating min_length and then filtering WikiText103 by it to produce a subsetted corpus of 17,824 documents.

I have analyzed this script, exactly. Just to see, how much preprocessing is required which is not much, fortunately. Wikitext documents are “huge”… no way I have similar length with my data.

Finally, there is also a whole family of sentence embedding models here that might be worth checking out.

Thanks, I am already using Sentence Transformer model for extension as I wrote: sentence-transformers/paraphrase-multilingual-mpnet-base-v2

0reactions
JohnGiorgicommented, Oct 7, 2021

Closing, feel free to re-open if you are still having issues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

SpanBERT: Improving Pre-training by Representing and ...
Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict ......
Read more >
Improving Pre-training by Representing and Predicting Spans
We presented a new method for span-based pre-training which extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2)...
Read more >
Span Selection Pre-training for Question Answering
Joshi et al. (2019) introduced Span- BERT, a task that predicts the tokens in a span from the boundary token representations.
Read more >
Improving Information Extraction on Business Documents with ...
For all pre-training tasks when a document is too long to be processed at once, we randomly select a continuous span of words...
Read more >
Tokenizer — transformers 2.11.0 documentation - Hugging Face
adding new tokens to the vocabulary in a way that is independant of the ... the low-level being the short-cut-names (string) of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found