question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Finetuning BART on CNN/DM : Why not truncating long articles ?

See original GitHub issue

In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify
--truncate-source.

By running the README command, I have following warning :

WARNING: 86040 samples have invalid sizes and will be skipped […]

But if I specify the option --truncate-source, the warning become :

WARNING: 5 samples have invalid sizes and will be skipped […]


Why the option --truncate-source is not used ?
It feels like skipping 80k samples is detrimental for performance, and other existing architectures always truncate article if it’s too long…

@ngoyal2707 @yinhanliu

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
ngoyal2707commented, Dec 2, 2019

@Colanim Yes, we truncated the long sources, in fact, that’s the very reason I introduced the --truncate-source arg in the code. Just a missing arg in readme, will add the fix. Thanks for pointing out @Colanim

1reaction
yinhanliucommented, Nov 23, 2019

@astariul thanks for pointing this out. Apparently Naman changed code when he pushed code than what I had when I fine-tuned cnn/dm. But Naman is on vacation now. I can only tell what I did to get the number in the paper. I truncated all the source into 1024 -4 tokens, and then made it into BOS truncated_source EOS MASK EOS. When we tried to release the code, Naman tried to remove MASK and EOS, and it turned out to not make a difference. Either way, we read all the instances from data, and truncated the longer ones fit it into 1024. The number in the paper used all the instances without filtering. I probably will have time later next week to take a look of this code to see where the truncation happened without --truncate-source. More likely, we forgot to put --truncate-source as True in a default setting. Thanks for pointing this out to me.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Finetuning BART on CNN/DM : Why not truncating long articles
In the README for finetuning BART on CNN/DM, the command for finetuning BART does not specify --truncate-source.
Read more >
nikhedward/bart-large-cnn-finetuned-multi-news - Hugging Face
Any text longer than that would be truncated. Training and evaluation data. More information needed. Training procedure.
Read more >
arXiv:2105.03801v2 [cs.CL] 29 May 2021
We demonstrate that content selection is essential, in particular for longer docu- ments such as the articles in the arXiv dataset. Our. BART(1k)+ ......
Read more >
Training Dynamics for Text Summarization Models - OpenReview
fine-tuning. Experiments are conducted on three. 064 different summarization datasets: XSUM (Narayan. 065 et al., 2018), CNNDM (Hermann et al., 2015; Nalla-.
Read more >
Domain Adaptation with Pre-trained Transformers for Query ...
Also, the authors did not fine-tune the pre-trained RSA model on the ... while training transformer models in long text sequences (Kitaev, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found