question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting huge number of training steps

See original GitHub issue

I have generated pretraining data using https://github.com/kamalkraj/ALBERT-TF2.0 because this supports training with multi GPU. I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month! So I have meta_data file associated with each tf.record file. I have added all the train_data_size values from all the meta_data files to make one meta_data file because in run_pretraining.py requires it. So my final meta_data file which looks something like this:

{
    "task_type": "albert_pretraining",
    "train_data_size": 596972848,
    "max_seq_length": 512,
    "max_predictions_per_seq": 20
}

Here number of training steps are calculated as below:

num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs

So total_train_examples is 596972848 hence I am getting num_train_steps to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000. I am not getting whats went wrong here.

With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge! As Albert was trained on very huge data why there are only 125000 steps only? Want to know-how many epochs are there in Albert training for English?

Can anyone suggest what should I do now?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
illuminascentcommented, Mar 6, 2020

@008karan If you haven’t done full shuffle on your data -> yes. Otherwise any subset of the training dataset shall represent the whole set well enough and its perfectly fine to stop in short of a complete epoch. Google has done similar thing when training T5, because the C4 dataset it too big to cover entirely.

0reactions
008karancommented, Mar 6, 2020

Still, I need to complete at least 1 epoch to pass whole data through the model, isn’t it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Choosing number of Steps per Epoch - Stack Overflow
Too large of a batch size can get you stuck in a local minima, so if your training get stuck, I would reduce...
Read more >
All You Need to Know about Batch Size, Epochs and Training ...
Batch size refers to the number of training instances in the batch. Epochs refer to the number of times the model sees the...
Read more >
How to set batch_size, steps_per epoch, and validation steps?
If you have a training set of fixed size you can ignore it but it may be useful if you have a huge...
Read more >
What is batch size, steps, iteration, and epoch in the neural ...
Training a neural network model you usually update a metric of your model using some calculations on the data. When the size of...
Read more >
A Guide to (Highly) Distributed DNN Training | by Chaim Rand
This is especially true if the batch size is much larger, i.e. in the case where we have a large number of workers....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found