Model is saved every eval_steps steps if eval_steps < save_steps. Is this expected behavior?
See original GitHub issueEnvironment info
transformers
version: 4.6.1- Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.5
- PyTorch version (GPU?): 1.7.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet …): Bert, but I don’t think that is relevant
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- Make a
TrainingArgs
object witheval_steps < save_steps
andeval_strategy
andsave_strategy
both set to"steps"
- Pass those to a
Trainer
- Model checkpoints every
eval_steps
steps, not everysave_steps
steps
Here is my TrainingArguments
code:
args = TrainingArguments(
output_dir=outpath,
save_total_limit=10,
load_best_model_at_end=True,
save_strategy="steps" if cli_args.save_steps is not None else "epoch",
save_steps=cli_args.save_steps,
evaluation_strategy="steps" if cli_args.eval_steps is not None else "epoch",
eval_steps=cli_args.eval_steps,
metric_for_best_model="loss",
learning_rate=cli_args.learning_rate,
per_device_train_batch_size=cli_args.batch_size,
per_device_eval_batch_size=cli_args.batch_size,
num_train_epochs=cli_args.num_train_epochs,
weight_decay=cli_args.weight_decay,
fp16=cli_args.fp16,
deepspeed=deepspeed,
local_rank=cli_args.local_rank,
)
with the values I am using filled in, this is:
args = TrainingArguments(
output_dir="ten_m/model",
save_total_limit=10,
load_best_model_at_end=True,
save_strategy="steps",
save_steps=6, # for testing
evaluation_strategy="steps",
eval_steps=2, # for testing
metric_for_best_model="loss",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
fp16=False,
deepspeed=None,
local_rank=-1,
)
Expected behavior
Well, maybe this is expected? But if so, I feel like it should be documented more obviously.
I wrote a callback to upload the saved checkpoint to GCS, but the eval step is very quick, so I was going to do those much more frequently. However, if evaluating means I have to upload to GCS, then I will evaluate less often. However, I verified that even if I don’t use the GCS save callback, with the above settings, a checkpoint is saved every 2 steps, not every 6.
If this is expected behavior, then is the correct way to change it to write a Callback that on_evaluate
sets the argument of type transformers.TrainerControl
to have property should_save
to False
?
Thank you
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
sure!
Sure! Do you want to make a PR with that change?