Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Race condition when using --save_total_limit, --load_best_model_at_end and deepspeed zero2+cpu_offload

See original GitHub issue

Environment info

transformers version: 4.5.1
Platform: Linux-5.4.0-1045-aws-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.7.1+cu110 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes, AWS p4d.24xlarge
Using distributed or parallel set-up in script?: yes, deepspeed

Who can help

Library:

deepspeed: @stas00
trainer: @sgugger

Information

Model I am using (Bert, XLNet …): roberta-large

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

I’m fine-tuning using run_mlm.py.

A race condition seems to exist when:

you limit the number of checkpoints with --save_total_limit
you enable --load_best_model_at_end --metric_for_best_model eval_loss
you use multigpu training with deepspeed zero2 + cpu_offload
when the best model happens to be at the head of the list returned by Trainer._sorted_checkpoints()

This corner case happens because the checkpoint being deleted is the most recent one due to the swapping logic in Trainer._sorted_checkpoints() at https://github.com/huggingface/transformers/blob/bf2e0cf70b68e0d46cdf15a4ece1f5c0a03de084/src/transformers/trainer.py#L1818-L1821

When (by chance) the best_model_index == 0, the swapping logic will cause the most recent checkpoint to go to the head of the list. When Trainer._rotate_checkpoints() is then called, it starts deleting from the head and consequently deletes the most recent checkpoint. (Aside: this is actually probably another bug in itself – you would never be able to resume training from the most recent checkpoint.) However, at this point, deepspeed has not finished writing its own global_checkpoint to the current checkpoint directory, causing the following error to be thrown:

INFO|trainer.py:1648] 2021-04-25 00:08:06,377 >> Saving model checkpoint to /mnt/experiments/roberta-large-mlm/checkpoint-23000
[INFO|configuration_utils.py:329] 2021-04-25 00:08:06,378 >> Configuration saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/config.json
[INFO|modeling_utils.py:831] 2021-04-25 00:08:09,054 >> Model weights saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/pytorch_model.bin
[INFO|tokenization_utils_base.py:1901] 2021-04-25 00:08:09,055 >> tokenizer config file saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/tokenizer_config.json
[INFO|tokenization_utils_base.py:1907] 2021-04-25 00:08:09,055 >> Special tokens file saved in /mnt/experiments/roberta-large-mlm/checkpoint-23000/special_tokens_map.json
[2021-04-25 00:08:09,211] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/mp_rank_00_model_states.pt
[2021-04-25 00:08:13,004] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,004] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_0_mp_rank_00_optim_states.pt
[INFO|trainer.py:1715] 2021-04-25 00:08:13,012 >> Deleting older checkpoint [/mnt/experiments/roberta-large-mlm/checkpoint-23000] due to args.save_total_limit
[2021-04-25 00:08:13,015] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,016] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_5_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,035] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,036] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_4_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,148] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,148] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_1_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,192] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,193] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_7_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,193] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,194] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_2_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,219] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,220] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_6_mp_rank_00_optim_states.pt
[2021-04-25 00:08:13,330] [INFO] [engine.py:1717:_copy_recovery_script] creating recovery script /mnt/experiments/roberta-large-mlm/checkpoint-23000/zero_to_fp32.py
[2021-04-25 00:08:13,331] [INFO] [engine.py:1730:_save_zero_checkpoint] zero checkpoint saved /mnt/experiments/roberta-large-mlm/checkpoint-23000/global_step23000/zero_pp_rank_3_mp_rank_00_optim_states.pt
Traceback (most recent call last):
  File "run_mlm.py", line 535, in <module>
    main()
  File "run_mlm.py", line 482, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1172, in train
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1269, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1346, in _save_checkpoint
    self._rotate_checkpoints(use_mtime=True, output_dir=run_dir)
  File "/home/cklin/ve/lib/python3.6/site-packages/transformers/trainer.py", line 1716, in _rotate_checkpoints
    shutil.rmtree(checkpoint)
  File "/usr/lib/python3.6/shutil.py", line 490, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/usr/lib/python3.6/shutil.py", line 488, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/mnt/experiments/roberta-large-mlm/checkpoint-23000'

Expected behavior

Instead of swapping logic in the lines referenced above, Trainer._sort_checkpoints() might instead do

checkpoints_sorted.append(checkpoints_sorted[best_model_index])
checkpoints_sorted.remove(checkpoints_sorted[best_model_index])

i.e., just move the best model to the end of the list.

I believe this will guarantee that the checkpoints (excluding the best model) will be deleted earliest first.

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

chitkwancommented, Jun 19, 2021

I reran my failure condition and it no longer fails, so I think this can be closed. Thanks!

1reaction

chitkwancommented, May 26, 2021

Sorry – this fell off my todo list but thank you for the fix.

The original race condition I reported may not be easy to reproduce but I’ll give it a go and report back.

Top Results From Across the Web

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Race condition when using --save_total_limit, --load_best_model_at_end and deepspeed zero2+cpu_offload

Environment info

Who can help

Information

To reproduce

Expected behavior

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[wav2vec] deepspeed eval bug in the case of >1 gpus

Allow adding custom logits processors in the `generate` method