Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting huge number of training steps

See original GitHub issue

I have generated pretraining data using https://github.com/kamalkraj/ALBERT-TF2.0 because this supports training with multi GPU. I am doing this for the Hindi language with 22gb of data. Generating pretraining data itself took 1 month! So I have meta_data file associated with each tf.record file. I have added all the train_data_size values from all the meta_data files to make one meta_data file because in run_pretraining.py requires it. So my final meta_data file which looks something like this:

{
    "task_type": "albert_pretraining",
    "train_data_size": 596972848,
    "max_seq_length": 512,
    "max_predictions_per_seq": 20
}

Here number of training steps are calculated as below:

num_train_steps = int(total_train_examples / train_batch_size) * num_train_epochs

So total_train_examples is 596972848 hence I am getting num_train_steps to be 9327700 with batch size of 64 and with 1 epoch only. I saw that in readme here num_train_steps=125000. I am not getting whats went wrong here.

With such huge train steps, it will take forever to train Albert. Even if I make batch size to 512 with 1 epoch only the training step will be 1165962 which is still huge! As Albert was trained on very huge data why there are only 125000 steps only? Want to know-how many epochs are there in Albert training for English?

Can anyone suggest what should I do now?

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

illuminascentcommented, Mar 6, 2020

@008karan If you haven’t done full shuffle on your data -> yes. Otherwise any subset of the training dataset shall represent the whole set well enough and its perfectly fine to stop in short of a complete epoch. Google has done similar thing when training T5, because the C4 dataset it too big to cover entirely.

0reactions

008karancommented, Mar 6, 2020

Still, I need to complete at least 1 epoch to pass whole data through the model, isn’t it?