[Deepspeed][initialization] pegasus: unable to load/init the weights
See original GitHub issueEnvironment info
transformers
version: 4.9.0.dev0- Platform: Ubuntu
- Python version: 3.8
- PyTorch version (GPU?): Y
- Using GPU in script?: Y
- Using distributed or parallel set-up in script?: Y - Deepspeed version: deepspeed 0.4.1 (installed with pip)
Information
I’m trying to fine-tuned pegasus-large model using deepspeed with multi-gpu. It seems that deepspeed is unable to initialize the weights in the beginning. While, I removed deepspeed and weights seem to be properly initialized. I’m hesitating if this is a bug with deepspeed library. Details are given below.
The command:
deepspeed --num_gpus=8 examples/pytorch/summarization/run_summarization.py \
--model_name_or_path google/pegasus-large \
--do_train \
--do_eval \
--do_predict \
--output_dir /home/code-base/user_space/saved_models/pegasus/reddit-xsum-1024-tuned/ \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=4 \
--learning_rate 3e-5 \
--weight_decay 0.01 \
--adam_beta2 0.98 \
--num_train_epochs 10 \
--overwrite_output_dir \
--predict_with_generate \
--evaluation_strategy steps --eval_steps 1000 --save_steps 1000 --warmup_steps 10000 \
--text_column document \
--summary_column summary \
--train_file $DS_BASE_DIR_P/train.json \
--validation_file $DS_BASE_DIR_P/validation.json \
--test_file $DS_BASE_DIR_P/test.json \
--deepspeed ds_config.json
Error message:
...
Traceback (most recent call last):
File "examples/pytorch/summarization/run_summarization.py", line 617, in <module>
main()
File "examples/pytorch/summarization/run_summarization.py", line 355, in main
model = AutoModelForSeq2SeqLM.from_pretrained(
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/auto/auto_factory.py", line 395, in from_pretrained
return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/modeling_utils.py", line 1176, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
f(module, *args, **kwargs)
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 1209, in __init__
self.model = PegasusModel(config)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
f(module, *args, **kwargs)
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 1082, in __init__
self.encoder = PegasusEncoder(config, self.shared)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
f(module, *args, **kwargs)
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 652, in __init__
self.embed_positions = PegasusSinusoidalPositionalEmbedding(
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 226, in wrapper
f(module, *args, **kwargs)
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 114, in __init__
self.weight = self._init_weight(self.weight)
File "/trainman-mount/trainman-k8s-storage-5ddccee4-32ad-4e32-ba2d-1d06b71f80b0/packages/transformers/src/transformers/models/pegasus/modeling_pegasus.py", line 122, in _init_weight
n_pos, dim = out.shape
ValueError: not enough values to unpack (expected 2, got 1)
Killing subprocess 3351
Killing subprocess 3352
Killing subprocess 3353
Killing subprocess 3354
Killing subprocess 3355
Killing subprocess 3356
Killing subprocess 3357
Killing subprocess 3358
...
ds_config.json
is Zero3 copied from the repository.- I checked
self.out
: withdeepspeed
its shape is[1]
and only contains a 1-d tensor with value 1. However, in single-gpu env, the shape is[1024, 1024]
which contains floating numbers (i.e., much like embeddings).
The problem arises when using:
- [ x] the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [ x] my own task or dataset: (give details below) --reddit_tifu_long
To reproduce
Steps to reproduce the behavior:
- Running the above command with deepspeed.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
DeepSpeed Integration — transformers 4.7.0 documentation
While the fp16 weights are fine for resuming training, if you finished finetuning your model and want to upload it to the models...
Read more >Model Checkpointing — DeepSpeed 0.8.0 documentation
DeepSpeed provides routines for checkpointing model state during training. ... Boolean to load only the model weights from the checkpoint. Ex. warmstarting.
Read more >Search Program - SC22 - Supercomputing
Prediction of cached data can greatly help improve cache management and hit rate. The recent advancement of deep learning techniques enables the design...
Read more >Protein Language Models and Structure Prediction - arXiv
Fine-tuning : A method that takes the weights of a pre-trained neural network, which are used to initialize a new model being trained...
Read more >Proceedings of the 21st BioNLP Workshop - ACL Anthology
We began with the assumption that we might be able to induce the answers to those ... huber, 1997) with randomly initialized weights....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
thank you for validating that it works for you.
I’m trying to have this solved on the deepspeed side, so that all our models will work w/o needing to change each one of them separately. so I will keep you posted on the progress.
so the quick fix is:
Let me know if you can handle the diff.
I will work on a normal PR and test. Ideally should think of something that requires less code changes, but it will do the right thing for now.