Can you share perplexity during pretraining for some experiments
See original GitHub issueHi, first of all, great work, congrats!!! The experiments in the paper are very detailed and they help in answering some very interesting questions.
Can you please share pretraining perplexity of some experiments (similar to BERT paper’s Table 6). It will provide more understanding into these models.
Specifically I am looking for:
-
Final ppl achieved by models in
table 14
of your paper, ie.T5-11B, T5-3B, T5-Large, T5-Base, T5-Small
. -
Final ppl for experiments in
Table 13
.
Thanks, I appreciate your time in answering these questions.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
Fine-tuning a masked language model - Hugging Face Course
Provided that the corpus used for pretraining is not too different from the corpus used for fine-tuning, transfer learning will usually produce good...
Read more >BERTIN: Efficient Pre-Training of a Spanish Language Model ...
In this work, we experiment with different sampling methods from the Spanish version of mC4, and present a novel data-centric technique ...
Read more >Adding Recurrence to Pretrained Transformers - OpenReview
Review: The paper proposes recurrent connections between two adjacent Transformers, which transfers the previous context to the next step. This is a practically ......
Read more >Pretraining a 124-M Parameter GPT-2 Language Model
The IdeaGPT-2ALBERTSetupTFRecordsCodeReplicating GPT-2Pretraining ... In contrast, if you share parameters across all transformer layers in ...
Read more >Perplexity-Based Molecule Ranking and Bias Estimation of ...
We analyzed the degree to which new molecules were generated due to the sole effect of pretraining, i.e., we verified whether “pretraining bias”...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
We don’t monitor the validation perplexity. C4 is so big that none of our models could possibly overfit to it. For all of the experiments in the paper except those in Table 14 we don’t even come close to making a single pass over the dataset; for the experiments in Table 14 we pre-train for 1T tokens which does end up being a single pass-ish depending on the mixing rate. But, overall, there’s no substantive difference between train and validation perplexity for any of the experiments (except those in Figure 6/Table 9, where we artificially limit the size of C4, but that’s the only case where we actually show the training loss).
For Table 14, the train perplexities are not comparable across models because they use different mixing rates (corresponding to a different artificial dataset size for each model). As a result some models see more or less unlabeled data, and as I mentioned above the perplexity for supervised tasks can be tiny due to the effectively limited set of reasonable tokens for classification tasks. I don’t think you would learn anything from looking at those perplexities.
For Table 13, here is a plot of the training losses (proportional to perplexity) for the variants in the table (not including ensembles, since ensembling is done post-hoc):
As a key, sc-bi_v1 is the baseline trained 4x as long, sc-bi_v1-bsx4 is the baseline trained with a 4x bigger batch size, sc-bi_v1-2x is the 2x larger model trained 2x as long, and sc-bi_v1-4x is the 4x bigger model trained for the same number of steps as the baseline.
Thanks for your interest!