Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perplexity number of wikitext-103 on gpt-2 don't match the paper

See original GitHub issue

Hi, The reported perplexity number of gpt-2 (117M) on wikitext-103 is 37.5.

However when I use the pre-trained tokenizer for gpt-2 GPT2Tokenizer using: tokenizer = GPT2Tokenizer.from_pretrained('gpt2') to tokenize wikitext-103, and then evaluate it using the pre-trained 117M gpt-2 model, I get a ppl of 48.4

Note: I have added newlines instead of EOS tags at the end of each line read. I’ve also normalized the loss by the number of tokens originally in wikitext-103 as mentioned by Alec Radford at https://github.com/openai/gpt-2/issues/78

Could you please let me know whats wrong here?

Issue Analytics

State:
Created 4 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

2reactions

AranKomatcommented, Jun 19, 2020

This discussion is old, so this may not be applicable anymore. But I’d like to offer a data point if it’s still the case. I tried gpt2 and gpt2-medium on OpenWebText (tokenized with HuggingFace’s corresponding tokenizer settings), and I got the ppl about 24 and 18, respectively, whereas the openai version of them is 17 and 13, respectively. This is good enough to say that I probably didn’t make any catastrophic mistake, but there still is some gap, which may or may not explain the performance gap on other datasets.

1reaction

leejasoncommented, Jun 25, 2019

@yaroslavvb @8enmann When I run this evaluation script, I get ppl of 23.85 on wikitext-2 (not raw), 28.186 on wikitext-103 (not raw), 29.01 on wikitext-103-raw (and wikitext-2-raw). None of these match the reported 117M gpt-2 model (which is the model available) numbers from the paper (29.41 for wikitext-2, 37.5 for wikitext-103)

Is it possible to know why 0.7 and 0.3? --> 0.7 * exp_average_loss + 0.3 * loss.item()

Top Results From Across the Web

[R] Struggling to reproduce perplexity benchmarks of ... - Reddit

Hi everyone - I'm struggling to reproduce some of the perplexity scores reported in GPT-2 (see table 3 here ) and other papers....

Graphcore/gpt2-medium-wikitext-103 - Hugging Face

Optimum Graphcore is a new open-source library and toolkit that enables developers to access IPU-optimized models certified by Hugging Face. It ...

Copy is All You Need | OpenReview

A: We test the perplexity of generated texts under an external large pre-trained language model (gpt2-large). The results (WikiText-103) are shown below. Models ......

WikiText-103 Benchmark (Language Modelling)

Rank Model Test perplexity Number of params Year Tags 1 Megatron‑LM 10.81 8300M 2019 Transformer 2 GLM‑XXLarge (bidirectional) 11.33 10000M 2021 3 GLM‑XXLarge (unidirectional) 12.22 10000M...

arXiv:2008.07027v1 [cs.CL] 16 Aug 2020

plexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a given amount of computation or memory. 1 Introduction.