Finetuning BART on CNN/DM : Why not truncating long articles ?
See original GitHub issueIn the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify
--truncate-source
.
By running the README command, I have following warning :
WARNING: 86040 samples have invalid sizes and will be skipped […]
But if I specify the option --truncate-source
, the warning become :
WARNING: 5 samples have invalid sizes and will be skipped […]
Why the option --truncate-source
is not used ?
It feels like skipping 80k samples is detrimental for performance, and other existing architectures always truncate article if it’s too long…
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Finetuning BART on CNN/DM : Why not truncating long articles
In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify --truncate-source.
Read more >nikhedward/bart-large-cnn-finetuned-multi-news - Hugging Face
Any text longer than that would be truncated. Training and evaluation data. More information needed. Training procedure.
Read more >arXiv:2105.03801v2 [cs.CL] 29 May 2021
We demonstrate that content selection is essential, in particular for longer docu- ments such as the articles in the arXiv dataset. Our. BART(1k)+ ......
Read more >Training Dynamics for Text Summarization Models - OpenReview
fine-tuning. Experiments are conducted on three. 064 different summarization datasets: XSUM (Narayan. 065 et al., 2018), CNNDM (Hermann et al., 2015; Nalla-.
Read more >Domain Adaptation with Pre-trained Transformers for Query ...
Also, the authors did not fine-tune the pre-trained RSA model on the ... while training transformer models in long text sequences (Kitaev, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Colanim Yes, we truncated the long sources, in fact, that’s the very reason I introduced the
--truncate-source
arg in the code. Just a missing arg in readme, will add the fix. Thanks for pointing out @Colanim@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting. Thanks for pointing this out to me.