Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Global steps smaller than total training set size after finished training

See original GitHub issue

Hey,

first of all, thanks for your great library, it’s been a huge help! I used your architecture to finetune a pretrained BERT model from the hugging face library on a smaller dataset for Binary Text Classification. From the training_progress_scores.csv I noticed, that no matter, what epoch number I choose, the global steps within the epochs do not even get close to the size of my train dataset. Does that mean, the model doesn’t even look at all training examples? I wonder if I am confusing something here.

Here is an an example: I am training on ~4790 samples/training examples with the following hyperparameters and settings:

{"adam_epsilon": 1e-08, "do_lower_case": true, "use_early_stopping": false, "early_stopping_delta": 0.01, "early_stopping_metric": "acc", "early_stopping_metric_minimize": false, "early_stopping_patience": 5, "encoding": "utf-8", "eval_batch_size": 8, "evaluate_during_training": true, "evaluate_during_training_steps": 500, "evaluate_during_training_verbose": true, "fp16": false, "gradient_accumulation_steps": 1, "learning_rate": 2e-5, "logging_steps": 500, "manual_seed": 17, "max_grad_norm": 1.0, "max_seq_length": 128, "num_train_epochs": 3, "n_gpu": 1, "overwrite_output_dir": true, "reprocess_input_data": false, "save_eval_checkpoints": false, "save_model_every_epoch": false, "save_steps": 2000, "train_batch_size": 8, "use_cached_eval_features": false, "use_multiprocessing": true, "warmup_ratio": 0.10, "weight_decay": 0}

Nevertheless, the last documented global step in my progress file is 1521, did I make a mistake or misunderstand something? Happy for any feedback, thanks!

My training_progress_score.csv

global_step,tp,tn,fp,fn,mcc,f1,precision,recall,f1_weighted,precision_weighted,recall_weighted,train_loss,eval_loss,acc 500,249,130,52,19,0.670966975526973,0.875219683655536,0.8272425249169435,0.9291044776119403,0.838932445100472,0.8455398733032571,0.8422222222222222,0.11351273953914642,0.3626793656955686,0.8422222222222222 507,249,129,53,19,0.666374588262315,0.8736842105263157,0.8245033112582781,0.9291044776119403,0.8365295055821371,0.8435600501163415,0.84,0.9388631582260132,0.3636407738453464,0.84 1000,215,161,21,53,0.6750015267273123,0.8531746031746031,0.9110169491525424,0.8022388059701493,0.836979316979317,0.846839502261647,0.8355555555555556,0.03929607570171356,0.4004484978422784,0.8355555555555556 1014,243,140,42,25,0.6884167901651631,0.8788426763110306,0.8526315789473684,0.9067164179104478,0.849752504170481,0.8509544568491937,0.8511111111111112,0.0117262601852417,0.40816347301006317,0.8511111111111112 1500,241,145,37,27,0.7029086608187551,0.8827838827838829,0.8669064748201439,0.8992537313432836,0.8570713906307128,0.8572470395776403,0.8577777777777778,0.003922566771507263,0.5339946605657276,0.8577777777777778 1521,241,147,35,27,0.7124595156999709,0.8860294117647057,0.8731884057971014,0.8992537313432836,0.8616872291987956,0.861718029873952,0.8622222222222222,0.010883927345275879,0.5320389102164068,0.8622222222222222

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

ThilinaRajapaksecommented, Jul 15, 2020

I think what you are looking for is evaluation during training. There, you periodically evaluate the model on eval/validation data while training on the train data. Your evaluation loss will generally decrease with training until the model starts overfitting, at which point the eval loss will start increasing. It’s generally a good idea to stop training when the evaluation loss stops improving.

Look for the evaluate_during_training_* configuration options in the docs. You might also want to look into early_stopping to automatically end training when evaluation loss stops improving.

1reaction

gitfabianmeyercommented, Jul 15, 2020

Honestly, this is more a question for stackoverflow or other helpful sites in the web. More data is always better to generalize more. Normally, you iterate over your whole dataset multiple times (epochs) to let your model converge. What you describe sounds way more like a good way to step into a local minimum (optimum) than really solving a problem.