Reproductibility issue
See original GitHub issueI’m having trouble reproducing the results on CNN/DM dataset.
I downloaded the data and the fine-tuned model provided in the README, and I followed the commands to predict the test set.
Everything is running fine, but at the end I have the following results :
1 ROUGE-1 Average_R: 0.62689 (95%-conf.int. 0.62269 - 0.63111) 1 ROUGE-1 Average_P: 0.13695 (95%-conf.int. 0.13561 - 0.13828) 1 ROUGE-1 Average_F: 0.22101 (95%-conf.int. 0.21918 - 0.22288)
1 ROUGE-2 Average_R: 0.33142 (95%-conf.int. 0.32673 - 0.33603) 1 ROUGE-2 Average_P: 0.06949 (95%-conf.int. 0.06832 - 0.07078) 1 ROUGE-2 Average_F: 0.11266 (95%-conf.int. 0.11089 - 0.11456)
1 ROUGE-L Average_R: 0.52624 (95%-conf.int. 0.52179 - 0.53061) 1 ROUGE-L Average_P: 0.11465 (95%-conf.int. 0.11345 - 0.11598) 1 ROUGE-L Average_F: 0.18509 (95%-conf.int. 0.18333 - 0.18698)
/root/code/unilm/src/cnndm_model/cnndm_model.bin.test.alp1.0 ROUGE-F(1/2/l): 22.10/11.27/18.51 ROUGE-R(1/2/3/l): 62.69/33.14/52.62
It’s weird because I checked the prediction file (cnndm_model.bin.test.alp1.0.post
) and compared it with the one provided in the README, and most of the time there is only a few differences.
Here is a comparison of the last few lines of the file (left is the ‘official’ one, right is mine)
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Thanks for spotting the incorrect script path.
-li
8c1f46d4e4ab7993665ac2a76406855c471a15df fixed my problem : I was using wrong script.
Using
cnndm/eval.py
instead ofgigaword/eval.py
fixed it.I could reproduce your result, using official ROUGE script.
Thanks a lot of the help !! 👍