question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Global steps smaller than total training set size after finished training

See original GitHub issue

Hey,

first of all, thanks for your great library, it’s been a huge help! I used your architecture to finetune a pretrained BERT model from the hugging face library on a smaller dataset for Binary Text Classification. From the training_progress_scores.csv I noticed, that no matter, what epoch number I choose, the global steps within the epochs do not even get close to the size of my train dataset. Does that mean, the model doesn’t even look at all training examples? I wonder if I am confusing something here.

Here is an an example: I am training on ~4790 samples/training examples with the following hyperparameters and settings:

{"adam_epsilon": 1e-08, "do_lower_case": true, "use_early_stopping": false, "early_stopping_delta": 0.01, "early_stopping_metric": "acc", "early_stopping_metric_minimize": false, "early_stopping_patience": 5, "encoding": "utf-8", "eval_batch_size": 8, "evaluate_during_training": true, "evaluate_during_training_steps": 500, "evaluate_during_training_verbose": true, "fp16": false, "gradient_accumulation_steps": 1, "learning_rate": 2e-5, "logging_steps": 500, "manual_seed": 17, "max_grad_norm": 1.0, "max_seq_length": 128, "num_train_epochs": 3, "n_gpu": 1, "overwrite_output_dir": true, "reprocess_input_data": false, "save_eval_checkpoints": false, "save_model_every_epoch": false, "save_steps": 2000, "train_batch_size": 8, "use_cached_eval_features": false, "use_multiprocessing": true, "warmup_ratio": 0.10, "weight_decay": 0}

Nevertheless, the last documented global step in my progress file is 1521, did I make a mistake or misunderstand something? Happy for any feedback, thanks!

My training_progress_score.csv

global_step,tp,tn,fp,fn,mcc,f1,precision,recall,f1_weighted,precision_weighted,recall_weighted,train_loss,eval_loss,acc 500,249,130,52,19,0.670966975526973,0.875219683655536,0.8272425249169435,0.9291044776119403,0.838932445100472,0.8455398733032571,0.8422222222222222,0.11351273953914642,0.3626793656955686,0.8422222222222222 507,249,129,53,19,0.666374588262315,0.8736842105263157,0.8245033112582781,0.9291044776119403,0.8365295055821371,0.8435600501163415,0.84,0.9388631582260132,0.3636407738453464,0.84 1000,215,161,21,53,0.6750015267273123,0.8531746031746031,0.9110169491525424,0.8022388059701493,0.836979316979317,0.846839502261647,0.8355555555555556,0.03929607570171356,0.4004484978422784,0.8355555555555556 1014,243,140,42,25,0.6884167901651631,0.8788426763110306,0.8526315789473684,0.9067164179104478,0.849752504170481,0.8509544568491937,0.8511111111111112,0.0117262601852417,0.40816347301006317,0.8511111111111112 1500,241,145,37,27,0.7029086608187551,0.8827838827838829,0.8669064748201439,0.8992537313432836,0.8570713906307128,0.8572470395776403,0.8577777777777778,0.003922566771507263,0.5339946605657276,0.8577777777777778 1521,241,147,35,27,0.7124595156999709,0.8860294117647057,0.8731884057971014,0.8992537313432836,0.8616872291987956,0.861718029873952,0.8622222222222222,0.010883927345275879,0.5320389102164068,0.8622222222222222

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ThilinaRajapaksecommented, Jul 15, 2020

I think what you are looking for is evaluation during training. There, you periodically evaluate the model on eval/validation data while training on the train data. Your evaluation loss will generally decrease with training until the model starts overfitting, at which point the eval loss will start increasing. It’s generally a good idea to stop training when the evaluation loss stops improving.

Look for the evaluate_during_training_* configuration options in the docs. You might also want to look into early_stopping to automatically end training when evaluation loss stops improving.

1reaction
gitfabianmeyercommented, Jul 15, 2020

Honestly, this is more a question for stackoverflow or other helpful sites in the web. More data is always better to generalize more. Normally, you iterate over your whole dataset multiple times (epochs) to let your model converge. What you describe sounds way more like a good way to step into a local minimum (optimum) than really solving a problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why training set should always be smaller than test set
The reason why training dataset is always chosen larger than the test one is that somebody says that the larger the data used...
Read more >
What is the difference between steps and epochs in ...
A training step is one gradient update. In one step batch_size examples are processed. An epoch consists of one full cycle through the...
Read more >
Effect of batch size on training dynamics | by Kevin Shen
This is intuitively explained by the fact that smaller batch sizes allow the model to “start learning before having to see all the...
Read more >
What is the trade-off between batch size and number of ...
Since batch_size only divides the training data set into batches, would it make sense to rearrange the dataset (non temporal) to have uniform ......
Read more >
How to Control the Stability of Training Neural Networks With ...
Historically, a training algorithm where the batch size is set to the total number of training examples is called “batch gradient descent” and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found