Models trained using Deepspeed ZeRO stage 3 have corrupted model weight shape
See original GitHub issueSystem Info
transformers version: 4.21.1 | 4.24.0
- Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.10.0
- PyTorch version (GPU?): 1.12.1+cu113 (True)
- Tensorflow version (GPU?): 2.10.0 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
I am currently trying to use deepspeed to finetune a AutoModelForCausalLM model (facebook/opt1.3b) on a multi-GPU instance with ZeRO optimization with the unmodified run_clm_no_trainer.py
script from the examples. When I use ZeRO stage 2 to train the model, the model weights can be loaded normally. However, when I try using ZeRO stage 3 with CPU offloads for the optimizer weights, the model training proceeds normally with loss values and metrics that make sense. But I get the follow error when I try loading the weights.
RuntimeError: Error(s) in loading state_dict for OPTForCausalLM:
size mismatch for model.decoder.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50272, 2560]).
size mismatch for model.decoder.embed_positions.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2050, 2560]).
size mismatch for model.decoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for model.decoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for model.decoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
size mismatch for model.decoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
...
size mismatch for model.decoder.layers.31.fc1.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
size mismatch for model.decoder.layers.31.fc2.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50272, 2560]).
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
This is very strange as the torch.Size([0])
error seems to be pervasive across all layers of the model, suggesting the weights are just empty and uninitialized. This is just speculation as the documentation does not seem to address the specifics of training with different ZeRO stages. I have tried the loading the model manually using AutoModelForCausalLM.from_pretrained('./model_dir')
where model_dir
is where the weights were saved after training, yet the same error is still thrown. I am not sure if this is a bug or using ZeRO stage 3 is currently unsupported. Any help would be much appreciated.
Expected behavior
Models trained using ZeRO stage 3 should load correctly.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Thanks @pacman100 . Just finished training the model and can confirm loading works correctly with the script you have linked. However I still had to modify the script to include this fix for an issue I had earlier to ensure the weights can be correctly loaded, link to issue.
Hello @JohnnyRacer , please refer below code snippet on changes required when saving deepspeed ZeRO-3 model. The example can be found here: deepspeed_with_config_support.py
https://github.com/huggingface/accelerate/blob/cea6aaa1161d45f7f23ef33fcc3b0a5999ebb5a1/examples/by_feature/deepspeed_with_config_support.py#L712-L723