Pre-training parameters
See original GitHub issueHi,
I’m currently training a BERT model from scratch using the same parameters as specified in scripts/cheatsheet.txt
.
@ibeltagy Could you confirm that these parameters are up-to-date 🤔
Loss seems to be fine, but I’m just wondering why training both 128 and 512 seq len models is a lot of faster with 3B tokens on a v3-8 TPU than your reported training time.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top Results From Across the Web
Choosing the right parameters for pre-training BERT using TPU
Choosing the right parameters for pre-training BERT using TPU · 1. Do lower Case · 2. Maximum Sequence length · 3. Train Batch...
Read more >Pretrained models — transformers 2.9.1 documentation
Here is the full list of the currently provided pretrained models together with a ... (New, recommended) 12-layer, 768-hidden, 12-heads, 110M parameters.
Read more >BERT Pre-training - DeepSpeed
utils.py : Training parameters and checkpoints saving/loading utilities. Argument ParsingPermalink. We first need to add DeepSpeed's argument parsing to train.
Read more >15.10. Pretraining BERT - Dive into Deep Learning
The original BERT has two versions, where the base model has 110 million parameters and the large model has 340 million parameters. After...
Read more >What Does Pre-training a Neural Network Mean? - Baeldung
Then using the parameters or model from this training to train another model on a different task or dataset. This gives the model...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
250 tfrecords, each file is 800-900MB (around 4000 papers)
Thanks Iz ❤️ Just a last question on that pre-training topic: what was the number of tfrecords (and their corresponding text size per shard) 🤔