Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeed] ZeRO stage 3 integration: getting started and issues

See original GitHub issue

Why would you want ZeRO-3

In a few words, while ZeRO-2 was very limited scability-wise - if model.half() couldn’t fit onto a single gpu, adding more gpus won’t have helped so if you had a 24GB GPU you couldn’t train a model larger than about 5B params.

Since with ZeRO-3 the model weights are partitioned across multiple GPUs plus offloaded to CPU, the upper limit on model size has increased by about 2 orders of magnitude. That is ZeRO-3 allows you to scale to huge models with Trillions of parameters assuming you have enough GPUs and general RAM to support this. ZeRO-3 can benefit a lot from general RAM if you have it. If not that’s OK too. ZeRO-3 combines all your GPUs memory and general RAM into a vast pool of memory.

If you don’t have many GPUs but just a single one but have a lot of general RAM ZeRO-3 will allow you to fit larger models.

Of course, if you run in an environment like the free google colab, while you can use run Deepspeed there, you get so little general RAM it’s very hard to make something out of nothing. Some users (or some sessions) one gets 12GB of RAM which is impossible to work with - you want at least 24GB instances. Setting is up might be tricky too, please see this notebook for an example: https://github.com/stas00/porting/blob/master/transformers/deepspeed/DeepSpeed_on_colab_CLI.ipynb

Getting started

Install the latest deepspeed version:

pip install deepspeed

You will want to be on a transformers master branch, if you want to run a quick test:

git clone https://github.com/huggingface/transformers
cd transformers
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/seq2seq/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_val_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \
--eval_steps 1 --group_by_length   --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3.json

You will find a very detailed configuration here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

Your new config file will look like this:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 3,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory" : true,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 0.94e6,
        "stage3_param_persistence_threshold": 1e4,
        "reduce_bucket_size": 1e6,
        "prefetch_bucket_size": 3e6,
        "sub_group_size": 1e14,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

So if you were already using ZeRO-2 it’s only the zero_optimization stage that has changed.

One of the biggest nuances of ZeRO-3 is that the model weights aren’t inside model.state_dict, as they are spread out through multiple gpus. The Trainer has been modified to support this but you will notice a slow model saving - as it has to consolidate weights from all the gpus. I’m planning to do more performance improvements in the future PRs, but for now let’s focus on making things work.

Issues / Questions

If you have any general questions or something is unclear/missing in the docs please don’t hesitate to ask in this thread. But for any bugs or problems please open a new Issue and tag me there. You don’t need to tag anybody else. Thank you!

Issue Analytics

State:
Created 2 years ago
Reactions:7
Comments:8 (5 by maintainers)

Top GitHub Comments

3reactions

stas00commented, Jun 29, 2021

Let’s ask Deepspeed devs: https://github.com/microsoft/DeepSpeed/issues/1194

Meanwhile if it works for you, that’s great! Thank you for doing the experiment.

0reactions

stas00commented, Jul 13, 2021

@sajastu, should be fixed in https://github.com/huggingface/transformers/pull/12690

Top Results From Across the Web

DeepSpeed Integration - Hugging Face

DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but it doesn't use an optimizer...

ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, ...

Training Overview and Features - DeepSpeed

Support for Custom Model Parallelism; Integration with Megatron-LM ... DeepSpeed addresses these challenges to accelerate model development and training.

ZeRO & DeepSpeed: New system optimizations enable ...

ZeRO can train deep learning models with 100 billion parameters on the current generation of GPU clusters at three to five times the ......

PyTorch Lightning vs DeepSpeed vs FSDP vs FFCV vs

DeepSpeed is a technique created by Microsoft to train massive ... inspired by DeepSpeed (stage 3) optimized for PyTorch compatibility (read about their ......