Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can you share perplexity during pretraining for some experiments

See original GitHub issue

Hi, first of all, great work, congrats!!! The experiments in the paper are very detailed and they help in answering some very interesting questions.

Can you please share pretraining perplexity of some experiments (similar to BERT paper’s Table 6). It will provide more understanding into these models.

Specifically I am looking for:

Final ppl achieved by models in table 14 of your paper, ie. T5-11B, T5-3B, T5-Large, T5-Base, T5-Small.
Final ppl for experiments in Table 13.

Thanks, I appreciate your time in answering these questions.

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

5reactions

craffelcommented, Oct 31, 2019

We don’t monitor the validation perplexity. C4 is so big that none of our models could possibly overfit to it. For all of the experiments in the paper except those in Table 14 we don’t even come close to making a single pass over the dataset; for the experiments in Table 14 we pre-train for 1T tokens which does end up being a single pass-ish depending on the mixing rate. But, overall, there’s no substantive difference between train and validation perplexity for any of the experiments (except those in Figure 6/Table 9, where we artificially limit the size of C4, but that’s the only case where we actually show the training loss).

For Table 14, the train perplexities are not comparable across models because they use different mixing rates (corresponding to a different artificial dataset size for each model). As a result some models see more or less unlabeled data, and as I mentioned above the perplexity for supervised tasks can be tiny due to the effectively limited set of reasonable tokens for classification tasks. I don’t think you would learn anything from looking at those perplexities.

For Table 13, here is a plot of the training losses (proportional to perplexity) for the variants in the table (not including ensembles, since ensembling is done post-hoc):

As a key, sc-bi_v1 is the baseline trained 4x as long, sc-bi_v1-bsx4 is the baseline trained with a 4x bigger batch size, sc-bi_v1-2x is the 2x larger model trained 2x as long, and sc-bi_v1-4x is the 4x bigger model trained for the same number of steps as the baseline.

1reaction

craffelcommented, Oct 31, 2019

Thanks for your interest!

Top Results From Across the Web

Fine-tuning a masked language model - Hugging Face Course

Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good...

BERTIN: Efficient Pre-Training of a Spanish Language Model ...

In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique ...

Adding Recurrence to Pretrained Transformers - OpenReview

Review: The paper proposes recurrent connections between two adjacent Transformers, which transfers the previous context to the next step. This is a practically ......

Pretraining a 124-M Parameter GPT-2 Language Model

The IdeaGPT-2ALBERTSetupTFRecordsCodeReplicating GPT-2Pretraining ... In contrast, if you share parameters across all transformer layers in ...

Perplexity-Based Molecule Ranking and Bias Estimation of ...

We analyzed the degree to which new molecules were generated due to the sole effect of pretraining, i.e., we verified whether “pretraining bias”...