Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Deepspeed][initialization] pegasus: unable to load/init the weights

See original GitHub issue

Environment info

transformers version: 4.9.0.dev0
Platform: Ubuntu
Python version: 3.8
PyTorch version (GPU?): Y
Using GPU in script?: Y
Using distributed or parallel set-up in script?: Y - Deepspeed version: deepspeed 0.4.1 (installed with pip)

@stas00,

Information

I’m trying to fine-tuned pegasus-large model using deepspeed with multi-gpu. It seems that deepspeed is unable to initialize the weights in the beginning. While, I removed deepspeed and weights seem to be properly initialized. I’m hesitating if this is a bug with deepspeed library. Details are given below.

The command:

deepspeed --num_gpus=8 examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path google/pegasus-large \
    --do_train \
    --do_eval \
    --do_predict \
    --output_dir /home/code-base/user_space/saved_models/pegasus/reddit-xsum-1024-tuned/ \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=4  \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --adam_beta2 0.98 \
    --num_train_epochs 10 \
    --overwrite_output_dir \
    --predict_with_generate \
    --evaluation_strategy steps  --eval_steps 1000 --save_steps 1000 --warmup_steps 10000 \
    --text_column document \
    --summary_column summary \
    --train_file $DS_BASE_DIR_P/train.json \
    --validation_file $DS_BASE_DIR_P/validation.json \
    --test_file $DS_BASE_DIR_P/test.json \
    --deepspeed ds_config.json

Error message:

...
Traceback (most recent call last):
  File "examples/pytorch/summarization/run_summarization.py", line 617, in <module>
    main()
  File "examples/pytorch/summarization/run_summarization.py", line 355, in main
    model = AutoModelForSeq2SeqLM.from_pretrained(
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/auto/auto_factory.py", line 395, in from_pretrained
    return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/modeling_utils.py", line 1176, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 1209, in __init__
    self.model = PegasusModel(config)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 1082, in __init__
    self.encoder = PegasusEncoder(config, self.shared)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 652, in __init__
    self.embed_positions = PegasusSinusoidalPositionalEmbedding(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
    f(module, *args, **kwargs)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 114, in __init__
    self.weight = self._init_weight(self.weight)
  File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 122, in _init_weight
    n_pos, dim = out.shape
ValueError: not enough values to unpack (expected 2, got 1)
Killing subprocess 3351
Killing subprocess 3352
Killing subprocess 3353
Killing subprocess 3354
Killing subprocess 3355
Killing subprocess 3356
Killing subprocess 3357
Killing subprocess 3358
...

ds_config.json is Zero3 copied from the repository.
I checked self.out: with deepspeed its shape is [1] and only contains a 1-d tensor with value 1. However, in single-gpu env, the shape is [1024, 1024] which contains floating numbers (i.e., much like embeddings).

The problem arises when using:

[ x] the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
[ x] my own task or dataset: (give details below) --reddit_tifu_long

To reproduce

Steps to reproduce the behavior:

Running the above command with deepspeed.

Issue Analytics

State:
Created 2 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

2reactions

stas00commented, Jun 29, 2021

thank you for validating that it works for you.

I’m trying to have this solved on the deepspeed side, so that all our models will work w/o needing to change each one of them separately. so I will keep you posted on the progress.

2reactions

stas00commented, Jun 28, 2021

so the quick fix is:

--- a/src/transformers/models/pegasus/modeling_pegasus.py
+++ b/src/transformers/models/pegasus/modeling_pegasus.py
@@ -26,6 +26,7 @@ from torch import nn
 from torch.nn import CrossEntropyLoss

 from ...activations import ACT2FN
+from ...deepspeed import is_deepspeed_zero3_enabled
 from ...file_utils import (
     add_end_docstrings,
     add_start_docstrings,
@@ -109,7 +110,13 @@ class PegasusSinusoidalPositionalEmbedding(nn.Embedding):

     def __init__(self, num_positions: int, embedding_dim: int, padding_idx: Optional[int] = None):
         super().__init__(num_positions, embedding_dim)
-        self.weight = self._init_weight(self.weight)
+        if is_deepspeed_zero3_enabled():
+            import deepspeed
+            with deepspeed.zero.GatheredParameters(self.weight, modifier_rank=0):
+                self.weight = self._init_weight(self.weight)
+        else:
+            self.weight = self._init_weight(self.weight)
+

     @staticmethod
     def _init_weight(out: nn.Parameter):

Let me know if you can handle the diff.

I will work on a normal PR and test. Ideally should think of something that requires less code changes, but it will do the right thing for now.

Top Results From Across the Web

DeepSpeed Integration — transformers 4.7.0 documentation

While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to the models...

Model Checkpointing — DeepSpeed 0.8.0 documentation

DeepSpeed provides routines for checkpointing model state during training. ... Boolean to load only the model weights from the checkpoint. Ex. warmstarting.

Search Program - SC22 - Supercomputing

Prediction of cached data can greatly help improve cache management and hit rate. The recent advancement of deep learning techniques enables the design...

Protein Language Models and Structure Prediction - arXiv

Fine-tuning : A method that takes the weights of a pre-trained neural network, which are used to initialize a new model being trained...

Proceedings of the 21st BioNLP Workshop - ACL Anthology

We began with the assumption that we might be able to induce the answers to those ... huber, 1997) with randomly initialized weights....