checkpoint breaks with deepspeed
See original GitHub issueEnvironment info
transformers
version: 4.3.3- Platform: linux
- Python version: 3.7
- PyTorch version (GPU?): 1.8
- Tensorflow version (GPU?): -
- Using GPU in script?: -
- Using distributed or parallel set-up in script?: -
Who can help
deepspeed: @stas00
Information
Dear @stas00 Having your permission, I opened up this bug, you are really my only hope with this issue and I truly appreciate your help. Thank you very much.
I am using mt5 model, I modified it with adding adapters layers. The problem arises when:
- loading checkpoints from the model trained with deepspeed
The tasks I am working on is:
- paraphrase detection using paws-x dataset on mt5 model
To reproduce
Steps to reproduce the behavior:
git clone git@github.com:dorost1234/codes.git
conda create --name deepspeed python=3.7
conda activate deepspeed
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge
python setup.py develop
pip install deepspeed
running the codes:
deepspeed run_seq2seq.py configs/test.json
I save a checkpoint every 10 steps, the output would look like the below:
After first checkpoint I kill the codes, here is the output:
onfiguration saved in outputs/checkpoint-10/config.json
Model weights saved in outputs/checkpoint-10/pytorch_model.bin
[2021-03-20 15:18:45,897] [INFO] [logging.py:60:log_dist] [Rank 0] Saving model checkpoint: outputs/checkpoint-10/global_step10/mp_rank_00_model_states.pt
[2021-03-20 15:18:51,783] [INFO] [engine.py:1680:_save_zero_checkpoint] zero checkpoint saved outputs/checkpoint-10/global_step10/zero_pp_rank_0_mp_rank_00optim_states.pt
Configuration saved in outputs/config.json
Model weights saved in outputs/pytorch_model.bin
Then, I contunue training with running the command again:
deepspeed run_seq2seq.py configs/test.json
once loading the checkpoint, it cannot load it with deepspeed:
successfully loaded 1 ZeRO state_dicts for rank 0
Traceback (most recent call last):
File "run_seq2seq.py", line 512, in <module>
main()
File "run_seq2seq.py", line 476, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/users/dara/dev/debug_codes/seq2seq/third_party/trainers/trainer.py", line 780, in train
self._load_optimizer_and_scheduler(resume_from_checkpoint)
File "/users/dara/dev/debug_codes/seq2seq/third_party/trainers/trainer.py", line 1169, in _load_optimizer_and_scheduler
self.deepspeed.load_checkpoint(checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True)
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1416, in load_checkpoint
load_optimizer_states=load_optimizer_states)
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1488, in _load_zero_checkpoint
load_from_fp32_weights=self.zero_load_from_fp32_weights())
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1844, in load_state_dict
self._restore_base_optimizer_state(state_dict_list)
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/runtime/zero/stage2.py", line 1805, in _restore_base_optimizer_state
self.optimizer.state[p][key].data.copy_(saved.data)
RuntimeError: The size of tensor a (302612288) must match the size of tensor b (129296512) at non-singleton dimension 0
Killing subprocess 23829
Traceback (most recent call last):
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
main()
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 161, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/users/dara/libs/anaconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/users/dara/anaconda3/envs/deepspeed/bin/python', '-u', 'run_seq2seq.py', '--local_rank=0', 'configs/test.json']' returned non-zero exit status 1.
Expected behavior
being able to continue training from the saved checkpoints
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Checkpoint breaks with deepspeed - Hugging Face Forums
Hi, I am trying to continue training from a saved checkpoint when using deepspeed. I am using transformers 4.3.3 Here is how I...
Read more >Activation Checkpointing — DeepSpeed 0.7.7 documentation
The activation checkpointing API's in DeepSpeed can be used to enable a range of memory optimizations relating to activation checkpointing. These include ...
Read more >Zero Redundancy Optimizer - DeepSpeed
The deepspeed.zero.TiledLinear module exploits the data fetch and release pattern of ZeRO-3 to reduce the working memory requirements by breaking down a large ......
Read more >Hugging Face on Twitter: "Last week @MetaAI publicly ...
Great win for Open-Source These checkpoints are now in transformers! ... Accelerate v0.8 breaks the 6B parameter limit on colab, ...
Read more >DeepSpeed Compression: A composable library for extreme ...
On the former, we released the DeepSpeed inference system, ... as well as ZeRO-Inference, which breaks the GPU memory wall and fits large ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Dear @stas00 Thank you very much for the help. Much appreciated. I upgraded the codes to the last version of the codes in huggingface repository and I am still having the same issue. I will make an updated the repository asap and keep you updated on this. Thank you very much.
Hi @stas00 I finally found this bug, this is the issue reported also here https://github.com/huggingface/transformers/issues/11294 I was freezing some parameters, and during checkpoiting, since huggingface codes does not handle the freezed parameters properly and has a bug currently regarding this, those parameters were not freezed when its loads from the checkpoints, and caused the difference in the number of parameters for the deepspeed. thanks a lot for all the hints and helps on this.