question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

checkpoint breaks with deepspeed

See original GitHub issue

Environment info

  • transformers version: 4.3.3
  • Platform: linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.8
  • Tensorflow version (GPU?): -
  • Using GPU in script?: -
  • Using distributed or parallel set-up in script?: -

Who can help

deepspeed: @stas00

Information

Dear @stas00 Having your permission, I opened up this bug, you are really my only hope with this issue and I truly appreciate your help. Thank you very much.

I am using mt5 model, I modified it with adding adapters layers. The problem arises when:

  • loading checkpoints from the model trained with deepspeed

The tasks I am working on is:

  • paraphrase detection using paws-x dataset on mt5 model

To reproduce

Steps to reproduce the behavior:

git clone git@github.com:dorost1234/codes.git
conda create --name deepspeed python=3.7
conda activate deepspeed
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
python setup.py develop
pip install deepspeed 

running the codes:

deepspeed run_seq2seq.py  configs/test.json

I save a checkpoint every 10 steps, the output would look like the below:

After first checkpoint I kill the codes, here is the output:
onfiguration saved in outputs/checkpoint-10/config.json                                                                                                                                       
Model weights saved in outputs/checkpoint-10/pytorch_model.bin
[2021-03-20 15:18:45,897] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: outputs/checkpoint-10/global_step10/mp_rank_00_model_states.pt
[2021-03-20 15:18:51,783] [INFO] [engine.py:1680:_save_zero_checkpoint] zero checkpoint saved outputs/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00optim_states.pt
Configuration saved in outputs/config.json
Model weights saved in outputs/pytorch_model.bin

Then, I contunue training with running the command again:

deepspeed run_seq2seq.py  configs/test.json

once loading the checkpoint, it cannot load it with deepspeed:

successfully loaded 1 ZeRO state_dicts for rank 0
Traceback (most recent call last):
  File "run_seq2seq.py", line 512, in <module>
    main()
  File "run_seq2seq.py", line 476, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/users/dara/dev/debug_codes/seq2seq/third_party/trainers/trainer.py", line 780, in train
    self._load_optimizer_and_scheduler(resume_from_checkpoint)
  File "/users/dara/dev/debug_codes/seq2seq/third_party/trainers/trainer.py", line 1169, in _load_optimizer_and_scheduler
    self.deepspeed.load_checkpoint(checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True)
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1416, in load_checkpoint
    load_optimizer_states=load_optimizer_states)
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1488, in _load_zero_checkpoint
    load_from_fp32_weights=self.zero_load_from_fp32_weights())
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1844, in load_state_dict
    self._restore_base_optimizer_state(state_dict_list)
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1805, in _restore_base_optimizer_state
    self.optimizer.state[p][key].data.copy_(saved.data)
RuntimeError: The size of tensor a (302612288) must match the size of tensor b (129296512) at non-singleton dimension 0
Killing subprocess 23829
Traceback (most recent call last):
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/users/dara/anaconda3/envs/deepspeed/bin/python', '-u', 'run_seq2seq.py', '--local_rank=0', 'configs/test.json']' returned non-zero exit status 1.

Expected behavior

being able to continue training from the saved checkpoints

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
dorost1234commented, Apr 1, 2021

Dear @stas00 Thank you very much for the help. Much appreciated. I upgraded the codes to the last version of the codes in huggingface repository and I am still having the same issue. I will make an updated the repository asap and keep you updated on this. Thank you very much.

1reaction
dorost1234commented, Apr 19, 2021

Hi @stas00 I finally found this bug, this is the issue reported also here https://github.com/huggingface/transformers/issues/11294 I was freezing some parameters, and during checkpoiting, since huggingface codes does not handle the freezed parameters properly and has a bug currently regarding this, those parameters were not freezed when its loads from the checkpoints, and caused the difference in the number of parameters for the deepspeed. thanks a lot for all the hints and helps on this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Checkpoint breaks with deepspeed - Hugging Face Forums
Hi, I am trying to continue training from a saved checkpoint when using deepspeed. I am using transformers 4.3.3 Here is how I...
Read more >
Activation Checkpointing — DeepSpeed 0.7.7 documentation
The activation checkpointing API's in DeepSpeed can be used to enable a range of memory optimizations relating to activation checkpointing. These include ...
Read more >
Zero Redundancy Optimizer - DeepSpeed
The deepspeed.zero.TiledLinear module exploits the data fetch and release pattern of ZeRO-3 to reduce the working memory requirements by breaking down a large ......
Read more >
Hugging Face on Twitter: "Last week @MetaAI publicly ...
Great win for Open-Source These checkpoints are now in transformers! ... Accelerate v0.8 breaks the 6B parameter limit on colab, ...
Read more >
DeepSpeed Compression: A composable library for extreme ...
On the former, we released the DeepSpeed inference system, ... as well as ZeRO-Inference, which breaks the GPU memory wall and fits large ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found