Race condition when using --save_total_limit, --load_best_model_at_end and deepspeed zero2+cpu_offload
See original GitHub issueEnvironment info
transformers
version: 4.5.1- Platform: Linux-5.4.0-1045-aws-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyTorch version (GPU?): 1.7.1+cu110 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes, AWS p4d.24xlarge
- Using distributed or parallel set-up in script?: yes, deepspeed
Who can help
Library:
Information
Model I am using (Bert, XLNet …): roberta-large
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
I’m fine-tuning using run_mlm.py.
A race condition seems to exist when:
- you limit the number of checkpoints with
--save_total_limit
- you enable
--load_best_model_at_end --metric_for_best_model eval_loss
- you use multigpu training with deepspeed zero2 + cpu_offload
- when the best model happens to be at the head of the list returned by Trainer._sorted_checkpoints()
This corner case happens because the checkpoint being deleted is the most recent one due to the swapping logic in Trainer._sorted_checkpoints()
at https://github.com/huggingface/transformers/blob/bf2e0cf70b68e0d46cdf15a4ece1f5c0a03de084/src/transformers/trainer.py#L1818-L1821
When (by chance) the best_model_index == 0
, the swapping logic will cause the most recent checkpoint to go to the head of the list. When Trainer._rotate_checkpoints()
is then called, it starts deleting from the head and consequently deletes the most recent checkpoint. (Aside: this is actually probably another bug in itself – you would never be able to resume training from the most recent checkpoint.) However, at this point, deepspeed has not finished writing its own global_checkpoint to the current checkpoint directory, causing the following error to be thrown:
INFO|trainer.py:1648] 2021-04-25 00:08:06,377 >> Saving model checkpoint to /mnt/experiments/roberta-large-mlm/checkpoint-23000
[INFO|configuration_utils.py:329] 2021-04-25 00:08:06,378 >> Configuration saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/config.json
[INFO|modeling_utils.py:831] 2021-04-25 00:08:09,054 >> Model weights saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/pytorch_model.bin
[INFO|tokenization_utils_base.py:1901] 2021-04-25 00:08:09,055 >> tokenizer config file saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/tokenizer_config.json
[INFO|tokenization_utils_base.py:1907] 2021-04-25 00:08:09,055 >> Special tokens file saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/special_tokens_map.json
[2021-04-25 00:08:09,211] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/mp_rank_00_model_states.pt
[2021-04-25 00:08:13,004] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,004] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_0_mp_rank_00_optim_states.pt
[INFO|trainer.py:1715] 2021-04-25 00:08:13,012 >> Deleting older checkpoint [/mnt/experiments/roberta-large-mlm/checkpoint-23000] due to args.save_total_limit
[2021-04-25 00:08:13,015] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,016] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_5_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,035] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,036] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_4_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,148] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,148] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_1_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,192] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,193] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_7_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,193] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,194] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,219] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,220] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_6_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,330] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,331] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_3_mp_rank_00_optim_states.pt
Traceback (most recent call last):
File "run_mlm.py", line 535, in <module>
main()
File "run_mlm.py", line 482, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1172, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1346, in _save_checkpoint
self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1716, in _rotate_checkpoints
shutil.rmtree(checkpoint)
File "/usr/lib/python3.6/shutil.py", line 490, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/usr/lib/python3.6/shutil.py", line 488, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/mnt/experiments/roberta-large-mlm/checkpoint-23000'
Expected behavior
Instead of swapping logic in the lines referenced above, Trainer._sort_checkpoints()
might instead do
checkpoints_sorted.append(checkpoints_sorted[best_model_index])
checkpoints_sorted.remove(checkpoints_sorted[best_model_index])
i.e., just move the best model to the end of the list.
I believe this will guarantee that the checkpoints (excluding the best model) will be deleted earliest first.
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
I reran my failure condition and it no longer fails, so I think this can be closed. Thanks!
Sorry – this fell off my todo list but thank you for the fix.
The original race condition I reported may not be easy to reproduce but I’ll give it a go and report back.