question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can you share perplexity during pretraining for some experiments

See original GitHub issue

Hi, first of all, great work, congrats!!! The experiments in the paper are very detailed and they help in answering some very interesting questions.

Can you please share pretraining perplexity of some experiments (similar to BERT paper’s Table 6). It will provide more understanding into these models.

Specifically I am looking for:

  1. Final ppl achieved by models in table 14 of your paper, ie. T5-11B, T5-3B, T5-Large, T5-Base, T5-Small.

  2. Final ppl for experiments in Table 13.

Thanks, I appreciate your time in answering these questions.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

5reactions
craffelcommented, Oct 31, 2019

We don’t monitor the validation perplexity. C4 is so big that none of our models could possibly overfit to it. For all of the experiments in the paper except those in Table 14 we don’t even come close to making a single pass over the dataset; for the experiments in Table 14 we pre-train for 1T tokens which does end up being a single pass-ish depending on the mixing rate. But, overall, there’s no substantive difference between train and validation perplexity for any of the experiments (except those in Figure 6/Table 9, where we artificially limit the size of C4, but that’s the only case where we actually show the training loss).

For Table 14, the train perplexities are not comparable across models because they use different mixing rates (corresponding to a different artificial dataset size for each model). As a result some models see more or less unlabeled data, and as I mentioned above the perplexity for supervised tasks can be tiny due to the effectively limited set of reasonable tokens for classification tasks. I don’t think you would learn anything from looking at those perplexities.

For Table 13, here is a plot of the training losses (proportional to perplexity) for the variants in the table (not including ensembles, since ensembling is done post-hoc):

image

As a key, sc-bi_v1 is the baseline trained 4x as long, sc-bi_v1-bsx4 is the baseline trained with a 4x bigger batch size, sc-bi_v1-2x is the 2x larger model trained 2x as long, and sc-bi_v1-4x is the 4x bigger model trained for the same number of steps as the baseline.

1reaction
craffelcommented, Oct 31, 2019

Thanks for your interest!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fine-tuning a masked language model - Hugging Face Course
Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good...
Read more >
BERTIN: Efficient Pre-Training of a Spanish Language Model ...
In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique ...
Read more >
Adding Recurrence to Pretrained Transformers - OpenReview
Review: The paper proposes recurrent connections between two adjacent Transformers, which transfers the previous context to the next step. This is a practically ......
Read more >
Pretraining a 124-M Parameter GPT-2 Language Model
The IdeaGPT-2ALBERTSetupTFRecordsCodeReplicating GPT-2Pretraining ... In contrast, if you share parameters across all transformer layers in ...
Read more >
Perplexity-Based Molecule Ranking and Bias Estimation of ...
We analyzed the degree to which new molecules were generated due to the sole effect of pretraining, i.e., we verified whether “pretraining bias”...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found