Perplexity number of wikitext-103 on gpt-2 don't match the paper
See original GitHub issueHi, The reported perplexity number of gpt-2 (117M) on wikitext-103 is 37.5.
However when I use the pre-trained tokenizer for gpt-2 GPT2Tokenizer
using:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
to tokenize wikitext-103, and then evaluate it using the pre-trained 117M gpt-2 model, I get a ppl of 48.4
Note: I have added newlines instead of EOS tags at the end of each line read. I’ve also normalized the loss by the number of tokens originally in wikitext-103 as mentioned by Alec Radford at https://github.com/openai/gpt-2/issues/78
Could you please let me know whats wrong here?
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
[R] Struggling to reproduce perplexity benchmarks of ... - Reddit
Hi everyone - I'm struggling to reproduce some of the perplexity scores reported in GPT-2 (see table 3 here ) and other papers....
Read more >Graphcore/gpt2-medium-wikitext-103 - Hugging Face
Optimum Graphcore is a new open-source library and toolkit that enables developers to access IPU-optimized models certified by Hugging Face. It ...
Read more >Copy is All You Need | OpenReview
A: We test the perplexity of generated texts under an external large pre-trained language model (gpt2-large). The results (WikiText-103) are shown below. Models ......
Read more >WikiText-103 Benchmark (Language Modelling)
Rank Model Test perplexity Number of params Year Tags
1 Megatron‑LM 10.81 8300M 2019 Transformer
2 GLM‑XXLarge (bidirectional) 11.33 10000M 2021
3 GLM‑XXLarge (unidirectional) 12.22 10000M...
Read more >arXiv:2008.07027v1 [cs.CL] 16 Aug 2020
plexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a given amount of computation or memory. 1 Introduction.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This discussion is old, so this may not be applicable anymore. But I’d like to offer a data point if it’s still the case. I tried gpt2 and gpt2-medium on OpenWebText (tokenized with HuggingFace’s corresponding tokenizer settings), and I got the ppl about 24 and 18, respectively, whereas the openai version of them is 17 and 13, respectively. This is good enough to say that I probably didn’t make any catastrophic mistake, but there still is some gap, which may or may not explain the performance gap on other datasets.
Is it possible to know why 0.7 and 0.3? --> 0.7 * exp_average_loss + 0.3 * loss.item()