question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Race condition when using --save_total_limit, --load_best_model_at_end and deepspeed zero2+cpu_offload

See original GitHub issue

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-5.4.0-1045-aws-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.7.1+cu110 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes, AWS p4d.24xlarge
  • Using distributed or parallel set-up in script?: yes, deepspeed

Who can help

Library:

Information

Model I am using (Bert, XLNet …): roberta-large

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

I’m fine-tuning using run_mlm.py.

A race condition seems to exist when:

  1. you limit the number of checkpoints with --save_total_limit
  2. you enable --load_best_model_at_end --metric_for_best_model eval_loss
  3. you use multigpu training with deepspeed zero2 + cpu_offload
  4. when the best model happens to be at the head of the list returned by Trainer._sorted_checkpoints()

This corner case happens because the checkpoint being deleted is the most recent one due to the swapping logic in Trainer._sorted_checkpoints() at https://github.com/huggingface/transformers/blob/bf2e0cf70b68e0d46cdf15a4ece1f5c0a03de084/src/transformers/trainer.py#L1818-L1821

When (by chance) the best_model_index == 0, the swapping logic will cause the most recent checkpoint to go to the head of the list. When Trainer._rotate_checkpoints() is then called, it starts deleting from the head and consequently deletes the most recent checkpoint. (Aside: this is actually probably another bug in itself – you would never be able to resume training from the most recent checkpoint.) However, at this point, deepspeed has not finished writing its own global_checkpoint to the current checkpoint directory, causing the following error to be thrown:

INFO|trainer.py:1648] 2021-04-25 00:08:06,377 >> Saving model checkpoint to /mnt/experiments/roberta-large-mlm/checkpoint-23000
[INFO|configuration_utils.py:329] 2021-04-25 00:08:06,378 >> Configuration saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/config.json
[INFO|modeling_utils.py:831] 2021-04-25 00:08:09,054 >> Model weights saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/pytorch_model.bin
[INFO|tokenization_utils_base.py:1901] 2021-04-25 00:08:09,055 >> tokenizer config file saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/tokenizer_config.json
[INFO|tokenization_utils_base.py:1907] 2021-04-25 00:08:09,055 >> Special tokens file saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/special_tokens_map.json
[2021-04-25 00:08:09,211] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/mp_rank_00_model_states.pt
[2021-04-25 00:08:13,004] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,004] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_0_mp_rank_00_optim_states.pt
[INFO|trainer.py:1715] 2021-04-25 00:08:13,012 >> Deleting older checkpoint [/mnt/experiments/roberta-large-mlm/checkpoint-23000] due to args.save_total_limit
[2021-04-25 00:08:13,015] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,016] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_5_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,035] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,036] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_4_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,148] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,148] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_1_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,192] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,193] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_7_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,193] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,194] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,219] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,220] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_6_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,330] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,331] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_3_mp_rank_00_optim_states.pt
Traceback (most recent call last):
  File "run_mlm.py", line 535, in <module>
    main()
  File "run_mlm.py", line 482, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1172, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1346, in _save_checkpoint
    self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1716, in _rotate_checkpoints
    shutil.rmtree(checkpoint)
  File "/usr/lib/python3.6/shutil.py", line 490, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.6/shutil.py", line 488, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/mnt/experiments/roberta-large-mlm/checkpoint-23000'

Expected behavior

Instead of swapping logic in the lines referenced above, Trainer._sort_checkpoints() might instead do

checkpoints_sorted.append(checkpoints_sorted[best_model_index])
checkpoints_sorted.remove(checkpoints_sorted[best_model_index])

i.e., just move the best model to the end of the list.

I believe this will guarantee that the checkpoints (excluding the best model) will be deleted earliest first.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
chitkwancommented, Jun 19, 2021

I reran my failure condition and it no longer fails, so I think this can be closed. Thanks!

1reaction
chitkwancommented, May 26, 2021

Sorry – this fell off my todo list but thank you for the fix.

The original race condition I reported may not be easy to reproduce but I’ll give it a go and report back.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found