question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Models trained using Deepspeed ZeRO stage 3 have corrupted model weight shape

See original GitHub issue

System Info

transformers version: 4.21.1 | 4.24.0

  • Platform: Linux-5.15.0-52-generic-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • Huggingface_hub version: 0.10.0
  • PyTorch version (GPU?): 1.12.1+cu113 (True)
  • Tensorflow version (GPU?): 2.10.0 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I am currently trying to use deepspeed to finetune a AutoModelForCausalLM model (facebook/opt1.3b) on a multi-GPU instance with ZeRO optimization with the unmodified run_clm_no_trainer.py script from the examples. When I use ZeRO stage 2 to train the model, the model weights can be loaded normally. However, when I try using ZeRO stage 3 with CPU offloads for the optimizer weights, the model training proceeds normally with loss values and metrics that make sense. But I get the follow error when I try loading the weights.

RuntimeError: Error(s) in loading state_dict for OPTForCausalLM:
        size mismatch for model.decoder.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50272, 2560]).
        size mismatch for model.decoder.embed_positions.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2050, 2560]).
        size mismatch for model.decoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for model.decoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for model.decoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
        size mismatch for model.decoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 2560]).
...
        size mismatch for model.decoder.layers.31.fc1.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([10240, 2560]).
        size mismatch for model.decoder.layers.31.fc2.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).
        size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50272, 2560]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

This is very strange as the torch.Size([0]) error seems to be pervasive across all layers of the model, suggesting the weights are just empty and uninitialized. This is just speculation as the documentation does not seem to address the specifics of training with different ZeRO stages. I have tried the loading the model manually using AutoModelForCausalLM.from_pretrained('./model_dir') where model_dir is where the weights were saved after training, yet the same error is still thrown. I am not sure if this is a bug or using ZeRO stage 3 is currently unsupported. Any help would be much appreciated.

Expected behavior

Models trained using ZeRO stage 3 should load correctly.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
JohnnyRacercommented, Nov 9, 2022

Thanks @pacman100 . Just finished training the model and can confirm loading works correctly with the script you have linked. However I still had to modify the script to include this fix for an issue I had earlier to ensure the weights can be correctly loaded, link to issue.

1reaction
pacman100commented, Nov 8, 2022

Hello @JohnnyRacer , please refer below code snippet on changes required when saving deepspeed ZeRO-3 model. The example can be found here: deepspeed_with_config_support.py

https://github.com/huggingface/accelerate/blob/cea6aaa1161d45f7f23ef33fcc3b0a5999ebb5a1/examples/by_feature/deepspeed_with_config_support.py#L712-L723

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Integration - Hugging Face
DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. ... same because the former has to gather model weights in addition to what...
Read more >
ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs
Most models require no modification to be trained with ZeRO-3. However, in some cases one may need to access model weights outside of...
Read more >
you may consider adding `ignore_mismatched_sizes=true` in the ...
I am trying to fine-tune using different pre-trainedf distil-bert models. ... trained using Deepspeed ZeRO stage 3 have corrupted model weight shape#20082.
Read more >
glm-130b: an open bilingual pre-trained model - arXiv
GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, ...
Read more >
Distributed communication package - torch.distributed - PyTorch
torch.distributed supports three built-in backends, each with different ... provide synchronous distributed training as a wrapper around any PyTorch model.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found