Impact of "shorter" documents (span, number of tokens) for extended pretraining
See original GitHub issueI am currently trying to use DeCLUTR for extended pretraining in a multilingual setting for documents of a domain-specific purpose.
I chose to use sentence-transformers/paraphrase-multilingual-mpnet-base-v2
and have approximately 100k documents of varying length. Thus, I quickly ran into the “known” token/span length error (see https://github.com/JohnGiorgi/DeCLUTR/blob/73a19313cd1707ce6a7a678451f41a3091205d4e/declutr/common/contrastive_utils.py#L48)
Since I cannot change the data basis I have now, I tried to adjust the span lengths, namely max_length
and min_length
in the config declutr.jsonnet
(everything else remained as is) and filtered my dataset to adhere to the minimum token length which does work in this particular setting. The results are I have ~10k documents left - enough for extended pretraining for a specific domain?
I ended up using min_length = 8
and max_length = 32
with a minimum token length of 128 in each document (derives from these settings since max_length
is multiplied by 4 in the DeCLUTR setting).
My question is: Does this make sense (reducing min or max lengths for spans and using only “longer” documents) or how can I approach this issue when facing “not so long documents” as for instance in the example-wise wikitext-103 settings etc.?
Are there maybe some hints or “rules of thumbs” I can follow?
Thanks a lot for your help!
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
OK, but you agree that you require documents for training to be longer (with 2 anchors, at least 4 x
max_length
) than you actually support for inference! This might be a serious issue for practical use, at least in my case.yes
Have only skimmed the paper, to be honest 😃
Ok, will do.
I have analyzed this script, exactly. Just to see, how much preprocessing is required which is not much, fortunately. Wikitext documents are “huge”… no way I have similar length with my data.
Thanks, I am already using Sentence Transformer model for extension as I wrote:
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Closing, feel free to re-open if you are still having issues.