Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetuning BART on CNN/DM : Why not truncating long articles ?

See original GitHub issue

In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify
--truncate-source.

By running the README command, I have following warning :

WARNING: 86040 samples have invalid sizes and will be skipped […]

But if I specify the option --truncate-source, the warning become :

WARNING: 5 samples have invalid sizes and will be skipped […]

Why the option --truncate-source is not used ?
It feels like skipping 80k samples is detrimental for performance, and other existing architectures always truncate article if it’s too long…

@ngoyal2707 @yinhanliu

Issue Analytics

State:
Created 4 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

ngoyal2707commented, Dec 2, 2019

@Colanim Yes, we truncated the long sources, in fact, that’s the very reason I introduced the --truncate-source arg in the code. Just a missing arg in readme, will add the fix. Thanks for pointing out @Colanim

1reaction

yinhanliucommented, Nov 23, 2019

@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting. Thanks for pointing this out to me.

Top Results From Across the Web

Finetuning BART on CNN/DM : Why not truncating long articles

In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify --truncate-source.

nikhedward/bart-large-cnn-finetuned-multi-news - Hugging Face

Any text longer than that would be truncated. Training and evaluation data. More information needed. Training procedure.

arXiv:2105.03801v2 [cs.CL] 29 May 2021

We demonstrate that content selection is essential, in particular for longer docu- ments such as the articles in the arXiv dataset. Our. BART(1k)+ ......

Training Dynamics for Text Summarization Models - OpenReview

fine-tuning. Experiments are conducted on three. 064 different summarization datasets: XSUM (Narayan. 065 et al., 2018), CNNDM (Hermann et al., 2015; Nalla-.

Domain Adaptation with Pre-trained Transformers for Query ...

Also, the authors did not fine-tune the pre-trained RSA model on the ... while training transformer models in long text sequences (Kitaev, ...