Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Difficulties to reproduce CNN/DM results with BART

See original GitHub issue

Thanks for the great effort toward releasing BART 😃

I’m currently having some difficulties reproducing BART results on the CNN/DM dataset.

I followed the README to test bert.large.cnn model.

I obtained following results :

1 ROUGE-1 Average_R: 0.50475 (95%-conf.int. 0.50197 - 0.50743) 1 ROUGE-1 Average_P: 0.39349 (95%-conf.int. 0.39102 - 0.39598) 1 ROUGE-1 Average_F: 0.43093 (95%-conf.int. 0.42857 - 0.43327)

1 ROUGE-2 Average_R: 0.23609 (95%-conf.int. 0.23327 - 0.23884) 1 ROUGE-2 Average_P: 0.18497 (95%-conf.int. 0.18268 - 0.18712) 1 ROUGE-2 Average_F: 0.20195 (95%-conf.int. 0.19956 - 0.20425)

1 ROUGE-L Average_R: 0.46458 (95%-conf.int. 0.46175 - 0.46720) 1 ROUGE-L Average_P: 0.36244 (95%-conf.int. 0.35999 - 0.36480) 1 ROUGE-L Average_F: 0.39678 (95%-conf.int. 0.39445 - 0.39898)

Which is more than 1 point lower than the expected output.

Any advice on how to reproduce the results is welcome.

@ngoyal2707 @yinhanliu

I personally think it comes from the dataset processing.
I think it’s honorable to try to use the script from abisee’s repository in order to have comparable results, but we still need to modify the script to make it work for BART, as mentioned here.
And these modification are still obscure.

Issue Analytics

State:
Created 4 years ago
Comments:15 (6 by maintainers)

Top GitHub Comments

5reactions

astariulcommented, Nov 21, 2019

With the changes pointed out by @ngoyal2707, here is my results :

1 ROUGE-1 Average_R: 0.51395 (95%-conf.int. 0.51137 - 0.51660) 1 ROUGE-1 Average_P: 0.40532 (95%-conf.int. 0.40290 - 0.40792) 1 ROUGE-1 Average_F: 0.44235 (95%-conf.int. 0.44009 - 0.44458)

1 ROUGE-2 Average_R: 0.24649 (95%-conf.int. 0.24376 - 0.24923) 1 ROUGE-2 Average_P: 0.19483 (95%-conf.int. 0.19248 - 0.19710) 1 ROUGE-2 Average_F: 0.21227 (95%-conf.int. 0.20990 - 0.21459)

1 ROUGE-L Average_R: 0.47664 (95%-conf.int. 0.47399 - 0.47927) 1 ROUGE-L Average_P: 0.37619 (95%-conf.int. 0.37373 - 0.37863) 1 ROUGE-L Average_F: 0.41043 (95%-conf.int. 0.40808 - 0.41257)

Results reproduced ! 😄

3reactions

ngoyal2707commented, Nov 20, 2019

@Colanim Thanks for your efforts on reproduction. Yes we train models with cased data, so having cased input is important during test time.

Some small subtle differences that could make that small gap you are seeing:

Change here to not have space.
We remove ‘(CNN)’ keyword from the article by adding following lines here

if cnn and article[:5] == '(CNN)':
      article = article[5:]

Top Results From Across the Web

Difficulties to reproduce CNN/DM results with BART #1401

I'm currently having some difficulties reproducing BART results on the CNN/DM dataset. I followed the README to test bert.large.cnn model.

Cannot reproduce the results - Hugging Face Forums

Hi I try to reproduce the result related to BART and the result is not comparable to the claimed performance. I tried sshleifer/distilbart-cnn-12-6...

arXiv:2211.06196v1 [cs.CL] 11 Nov 2022

Results of applying post-editing models to BART on CNN/DM are shown in Table 5. BART-c decreases performance in factual consistency on. CNN/DM, ...

Long-Span Summarization via Local Attention and Content ...

grow with N, M, and B. Table 2 shows system- independent5 regression results for the memory in training BART. It is apparent that...

BARTSCORE: Evaluating Generated Text as Text Generation

as a text generation problem, modeled using pre-trained sequence-to-sequence models. ... We use BART fine-tuned on CNNDM dataset [20], which is.