question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perplexity number of wikitext-103 on gpt-2 don't match the paper

See original GitHub issue

Hi, The reported perplexity number of gpt-2 (117M) on wikitext-103 is 37.5.

However when I use the pre-trained tokenizer for gpt-2 GPT2Tokenizer using: tokenizer = GPT2Tokenizer.from_pretrained('gpt2') to tokenize wikitext-103, and then evaluate it using the pre-trained 117M gpt-2 model, I get a ppl of 48.4

Note: I have added newlines instead of EOS tags at the end of each line read. I’ve also normalized the loss by the number of tokens originally in wikitext-103 as mentioned by Alec Radford at https://github.com/openai/gpt-2/issues/78

Could you please let me know whats wrong here?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
AranKomatcommented, Jun 19, 2020

This discussion is old, so this may not be applicable anymore. But I’d like to offer a data point if it’s still the case. I tried gpt2 and gpt2-medium on OpenWebText (tokenized with HuggingFace’s corresponding tokenizer settings), and I got the ppl about 24 and 18, respectively, whereas the openai version of them is 17 and 13, respectively. This is good enough to say that I probably didn’t make any catastrophic mistake, but there still is some gap, which may or may not explain the performance gap on other datasets.

1reaction
leejasoncommented, Jun 25, 2019

@yaroslavvb @8enmann When I run this evaluation script, I get ppl of 23.85 on wikitext-2 (not raw), 28.186 on wikitext-103 (not raw), 29.01 on wikitext-103-raw (and wikitext-2-raw). None of these match the reported 117M gpt-2 model (which is the model available) numbers from the paper (29.41 for wikitext-2, 37.5 for wikitext-103)

Is it possible to know why 0.7 and 0.3? --> 0.7 * exp_average_loss + 0.3 * loss.item()

Read more comments on GitHub >

github_iconTop Results From Across the Web

[R] Struggling to reproduce perplexity benchmarks of ... - Reddit
Hi everyone - I'm struggling to reproduce some of the perplexity scores reported in GPT-2 (see table 3 here ) and other papers....
Read more >
Graphcore/gpt2-medium-wikitext-103 - Hugging Face
Optimum Graphcore is a new open-source library and toolkit that enables developers to access IPU-optimized models certified by Hugging Face. It ...
Read more >
Copy is All You Need | OpenReview
A: We test the perplexity of generated texts under an external large pre-trained language model (gpt2-large). The results (WikiText-103) are shown below. Models ......
Read more >
WikiText-103 Benchmark (Language Modelling)
Rank Model Test perplexity Number of params Year Tags 1 Megatron‑LM 10.81 8300M 2019 Transformer 2 GLM‑XXLarge (bidirectional) 11.33 10000M 2021 3 GLM‑XXLarge (unidirectional) 12.22 10000M...
Read more >
arXiv:2008.07027v1 [cs.CL] 16 Aug 2020
plexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a given amount of computation or memory. 1 Introduction.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found