Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

issue with loading pretrained model using DeepSpeed Zero Stage 3

See original GitHub issue

System Info

- `transformers` version: 4.19.0.dev0
- Platform: Linux-5.4.0-90-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch version (GPU?): 1.12.0.dev20220505+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes (deepspeed zero stage-3)

Who can help?

@stas00 @sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour:

Official run_glue.py script
Below ZERO Stage-3 Config zero3_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

bash script to run the finetuning of bert-base-uncased on MRPC dataset using ZERO Stage-3.

#!/bin/bash

time torchrun --nproc_per_node=2 run_glue.py \
--task_name "mrpc" \
--max_seq_len 128 \
--model_name_or_path "bert-base-uncased" \
--output_dir "./glue/mrpc_deepspeed_stage3_trainer" \
--overwrite_output_dir \
--do_train \
--evaluation_strategy "epoch" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 3 \
--lr_scheduler_type "linear" \
--warmup_steps 50 \
--logging_steps 100 \
--fp16 \
--fp16_full_eval \
--optim "adamw_torch" \
--report_to "wandb" \
--deepspeed "zero3_config.json"

Relevant output snippets. The first one shows the weird behaviour wherein the model isn’t being properly initialized with the pretrained weights. The second shows the eval metrics showing the random performance.

model init bad performance

Expected behavior

Model being properly initialized with the pretrained weights when using DeepSpeed ZERO Stage-3. This should resolve the bad model performance being observed.

Issue Analytics

State:
Created a year ago
Comments:12 (11 by maintainers)

Top GitHub Comments

1reaction

pacman100commented, May 24, 2022

Hello @stas00, yes the above PR solves this issue. Thank you 😄 . Below are the plots finetuning microsoft/deberta-v2-xlarge-mnli (pretrained model has 3 labels) on MRPC (this task has 2 labels) dataset. Screenshot 2022-05-24 at 12 18 30 PM