Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeed] ZeRO-Infinity integration: getting started and issues

See original GitHub issue

DeepSpeed ZeRO-Infinity HF Integration is now available in the master branch of transformers. Here is a quick getting started/what’s new post.

ZeRO-Infinity extends ZeRO-3 by extending CPU Offload with NVMe Offload, enabling training even bigger models. And it adds various other optimizations and improvements.

Getting started

Install the latest deepspeed version:

pip install git+https://github.com/microsoft/DeepSpeed

You will want to be on a transformers master branch, if you want to run a quick test:


git clone https://github.com/huggingface/transformers
cd transformers
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_eval_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \
--eval_steps 1 --group_by_length   --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3.json

You will find a very detailed documentation here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

Your new config file will look like this (for ZeRO-3 as an example):

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e14,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

If you want to experiment with NVMe offload, please see: https://huggingface.co/transformers/master/main_classes/trainer.html#nvme-support

Deepspeed currently runs only fp16-mixed precision

While deepspeed devs are working on the fp32 mode, at this moment only fp16-amp-like train/eval is available. So if your model struggles under fp16/amp it will have the same struggles under deepspeed.

Moreover, because deepspeed does model.half() forcing all weights to fp16, some models might be ready for this (under AMP things are switched dynamically to fp16 where needed). If you run into this please post a new issue and we will try to find a solution/workaround for those special cases.

must use the latest `transformers` master

If you get deepspeed errors like it doesn’t know what auto value is, you aren’t on latest transformers master branch, git pull if you already have a clone and if you installed it already update your install.

For those who already use DeepSpeed HF integration

As the integration part is evolving it has gone through a major revamp and various improvements.

There are 2 important changes that you need to be aware of if you’re already using DeepSpeed integration in transformers:

After this release only config params that are set to auto will get automatically overriden/set to the correct/recommended values, everything else is left as is. This is to avoid the previously confusing behavior of never being quite sure what gets overridden and what not despite the logger telling what it did override. The new behavior is completely unambiguous.

See examples
- zero2
- zero3
Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#shared-configuration
If you are using massive models and aren’t using example scripts, make sure to read:

Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#constructing-massive-models

Everything else should work as before or better.

The docs were revamped a lot too - if you find anything unclear or lacking please let me know.

If you encounter any problems please post an Issue and tag @stas00 to it.

Thank you!

Issue Analytics

State:
Created 2 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

2reactions

tjruwasecommented, May 3, 2021

@thies1006, there is now a PR for the assert:

1reaction

tjruwasecommented, May 2, 2021

@thies1006, thanks for reporting this issue. As @stas00 suggested, could please report this as a deepspeed issue? It would be great if you included the exact ds_config.json in the issue report. Thanks so much!

Top Results From Across the Web

DeepSpeed Integration

Integration of the core DeepSpeed features via Trainer. This is an everything-done-for-you type of integration - just supply your custom config file or...

ZeRO-Infinity and DeepSpeed: Unlocking unprecedented ...

The DeepSpeed curated environment in Azure Machine Learning makes it easier for users to get started on Azure. DeepSpeed is now integrated ......

Training your large model with DeepSpeed

DeepSpeed has been used to train or is in the process of training some of the largest dense ... You can find the...

deepspeed

DeepSpeed addresses these challenges to accelerate model development and training. Installation. The quickest way to get started with DeepSpeed is via pip, ...

deepspeed - Python Package Health Analysis

Learn more about deepspeed: package health score, popularity, security, maintenance, ... The quickest way to get started with DeepSpeed is via pip, ...