question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difficulties to reproduce CNN/DM results with BART

See original GitHub issue

Thanks for the great effort toward releasing BART 😃

I’m currently having some difficulties reproducing BART results on the CNN/DM dataset.

I followed the README to test bert.large.cnn model.

I obtained following results :

1 ROUGE-1 Average_R: 0.50475 (95%-conf.int. 0.50197 - 0.50743) 1 ROUGE-1 Average_P: 0.39349 (95%-conf.int. 0.39102 - 0.39598) 1 ROUGE-1 Average_F: 0.43093 (95%-conf.int. 0.42857 - 0.43327)

1 ROUGE-2 Average_R: 0.23609 (95%-conf.int. 0.23327 - 0.23884) 1 ROUGE-2 Average_P: 0.18497 (95%-conf.int. 0.18268 - 0.18712) 1 ROUGE-2 Average_F: 0.20195 (95%-conf.int. 0.19956 - 0.20425)

1 ROUGE-L Average_R: 0.46458 (95%-conf.int. 0.46175 - 0.46720) 1 ROUGE-L Average_P: 0.36244 (95%-conf.int. 0.35999 - 0.36480) 1 ROUGE-L Average_F: 0.39678 (95%-conf.int. 0.39445 - 0.39898)

Which is more than 1 point lower than the expected output.

Any advice on how to reproduce the results is welcome.

@ngoyal2707 @yinhanliu


I personally think it comes from the dataset processing.
I think it’s honorable to try to use the script from abisee’s repository in order to have comparable results, but we still need to modify the script to make it work for BART, as mentioned here.
And these modification are still obscure.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:15 (6 by maintainers)

github_iconTop GitHub Comments

5reactions
astariulcommented, Nov 21, 2019

With the changes pointed out by @ngoyal2707, here is my results :

1 ROUGE-1 Average_R: 0.51395 (95%-conf.int. 0.51137 - 0.51660) 1 ROUGE-1 Average_P: 0.40532 (95%-conf.int. 0.40290 - 0.40792) 1 ROUGE-1 Average_F: 0.44235 (95%-conf.int. 0.44009 - 0.44458)

1 ROUGE-2 Average_R: 0.24649 (95%-conf.int. 0.24376 - 0.24923) 1 ROUGE-2 Average_P: 0.19483 (95%-conf.int. 0.19248 - 0.19710) 1 ROUGE-2 Average_F: 0.21227 (95%-conf.int. 0.20990 - 0.21459)

1 ROUGE-L Average_R: 0.47664 (95%-conf.int. 0.47399 - 0.47927) 1 ROUGE-L Average_P: 0.37619 (95%-conf.int. 0.37373 - 0.37863) 1 ROUGE-L Average_F: 0.41043 (95%-conf.int. 0.40808 - 0.41257)

Results reproduced ! 😄

3reactions
ngoyal2707commented, Nov 20, 2019

@Colanim Thanks for your efforts on reproduction. Yes we train models with cased data, so having cased input is important during test time.

Some small subtle differences that could make that small gap you are seeing:

  1. Change here to not have space.

  2. We remove ‘(CNN)’ keyword from the article by adding following lines here

if cnn and article[:5] == '(CNN)':
      article = article[5:]
Read more comments on GitHub >

github_iconTop Results From Across the Web

Difficulties to reproduce CNN/DM results with BART #1401
I'm currently having some difficulties reproducing BART results on the CNN/DM dataset. I followed the README to test bert.large.cnn model.
Read more >
Cannot reproduce the results - Hugging Face Forums
Hi I try to reproduce the result related to BART and the result is not comparable to the claimed performance. I tried sshleifer/distilbart-cnn-12-6...
Read more >
arXiv:2211.06196v1 [cs.CL] 11 Nov 2022
Results of applying post-editing models to BART on CNN/DM are shown in Table 5. BART-c decreases performance in factual consistency on. CNN/DM, ...
Read more >
Long-Span Summarization via Local Attention and Content ...
grow with N, M, and B. Table 2 shows system- independent5 regression results for the memory in training BART. It is apparent that...
Read more >
BARTSCORE: Evaluating Generated Text as Text Generation
as a text generation problem, modeled using pre-trained sequence-to-sequence models. ... We use BART fine-tuned on CNNDM dataset [20], which is.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found