Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] `--load_best_model_at_end` silently turns of `--save_steps` settings

See original GitHub issue

Splitting off from https://github.com/huggingface/transformers/pull/12477#discussion_r668326212

Currently --load_best_model_at_end silently turns off --save_steps settings when --do_eval is off (or --evaluation_strategy is set to other than "no", which otherwise automatically turns on --do_eval)

The proposal is to assert if:

--load_best_model_at_end is set and --evaluation_strategy is "no"

Reproducible test:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06  --do_train --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 500 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --warmup_steps 50 --max_train_samples 50 --save_steps 1

which saves checkpoints.

then adding --load_best_model_at_end stops saving those.

@sgugger.

Issue Analytics

State:
Created 2 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Jul 13, 2021

Yes, as said in that comment, I think it’s reasonable if we raise an error if --load_best_model_at_end is set and --evaluation_strategy is “no” since there is no “best model” to pick from in that case. I can do it later today if you want.

0reactions

sguggercommented, Aug 30, 2021

Yes, this was fixed by #12786 in the end.