question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[trainer] `--load_best_model_at_end` silently turns of `--save_steps` settings

See original GitHub issue

Splitting off from https://github.com/huggingface/transformers/pull/12477#discussion_r668326212

Currently --load_best_model_at_end silently turns off --save_steps settings when --do_eval is off (or --evaluation_strategy is set to other than "no", which otherwise automatically turns on --do_eval)

The proposal is to assert if:

--load_best_model_at_end is set and --evaluation_strategy is "no"

Reproducible test:

export BS=16; rm -r output_dir; PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 examples/pytorch/translation/run_translation.py --model_name_or_path t5-small --output_dir output_dir --adam_eps 1e-06  --do_train --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 500 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --predict_with_generate --sortish_sampler --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" --source_prefix "translate English to Romanian: " --warmup_steps 50 --max_train_samples 50 --save_steps 1 

which saves checkpoints.

then adding --load_best_model_at_end stops saving those.

@sgugger.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Jul 13, 2021

Yes, as said in that comment, I think it’s reasonable if we raise an error if --load_best_model_at_end is set and --evaluation_strategy is “no” since there is no “best model” to pick from in that case. I can do it later today if you want.

0reactions
sguggercommented, Aug 30, 2021

Yes, this was fixed by #12786 in the end.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found