Difficulties to reproduce CNN/DM results with BART
See original GitHub issueThanks for the great effort toward releasing BART 😃
I’m currently having some difficulties reproducing BART results on the CNN/DM dataset.
I followed the README to test bert.large.cnn
model.
I obtained following results :
1 ROUGE-1 Average_R: 0.50475 (95%-conf.int. 0.50197 - 0.50743) 1 ROUGE-1 Average_P: 0.39349 (95%-conf.int. 0.39102 - 0.39598) 1 ROUGE-1 Average_F: 0.43093 (95%-conf.int. 0.42857 - 0.43327)
1 ROUGE-2 Average_R: 0.23609 (95%-conf.int. 0.23327 - 0.23884) 1 ROUGE-2 Average_P: 0.18497 (95%-conf.int. 0.18268 - 0.18712) 1 ROUGE-2 Average_F: 0.20195 (95%-conf.int. 0.19956 - 0.20425)
1 ROUGE-L Average_R: 0.46458 (95%-conf.int. 0.46175 - 0.46720) 1 ROUGE-L Average_P: 0.36244 (95%-conf.int. 0.35999 - 0.36480) 1 ROUGE-L Average_F: 0.39678 (95%-conf.int. 0.39445 - 0.39898)
Which is more than 1 point lower than the expected output.
Any advice on how to reproduce the results is welcome.
I personally think it comes from the dataset processing.
I think it’s honorable to try to use the script from abisee’s repository in order to have comparable results, but we still need to modify the script to make it work for BART, as mentioned here.
And these modification are still obscure.
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (6 by maintainers)
Top GitHub Comments
With the changes pointed out by @ngoyal2707, here is my results :
Results reproduced ! 😄
@Colanim Thanks for your efforts on reproduction. Yes we train models with cased data, so having cased input is important during test time.
Some small subtle differences that could make that small gap you are seeing:
Change here to not have space.
We remove ‘(CNN)’ keyword from the
article
by adding following lines here