question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DeepSpeed] ZeRO-Infinity integration: getting started and issues

See original GitHub issue

DeepSpeed ZeRO-Infinity HF Integration is now available in the master branch of transformers. Here is a quick getting started/what’s new post.

ZeRO-Infinity extends ZeRO-3 by extending CPU Offload with NVMe Offload, enabling training even bigger models. And it adds various other optimizations and improvements.

Getting started

Install the latest deepspeed version:

pip install git+https://github.com/microsoft/DeepSpeed

You will want to be on a transformers master branch, if you want to run a quick test:


git clone https://github.com/huggingface/transformers
cd transformers
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_eval_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \
--eval_steps 1 --group_by_length   --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3.json

You will find a very detailed documentation here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

Your new config file will look like this (for ZeRO-3 as an example):

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e14,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

If you want to experiment with NVMe offload, please see: https://huggingface.co/transformers/master/main_classes/trainer.html#nvme-support

Deepspeed currently runs only fp16-mixed precision

While deepspeed devs are working on the fp32 mode, at this moment only fp16-amp-like train/eval is available. So if your model struggles under fp16/amp it will have the same struggles under deepspeed.

Moreover, because deepspeed does model.half() forcing all weights to fp16, some models might be ready for this (under AMP things are switched dynamically to fp16 where needed). If you run into this please post a new issue and we will try to find a solution/workaround for those special cases.

must use the latest transformers master

If you get deepspeed errors like it doesn’t know what auto value is, you aren’t on latest transformers master branch, git pull if you already have a clone and if you installed it already update your install.

For those who already use DeepSpeed HF integration

As the integration part is evolving it has gone through a major revamp and various improvements.

There are 2 important changes that you need to be aware of if you’re already using DeepSpeed integration in transformers:

  1. After this release only config params that are set to auto will get automatically overriden/set to the correct/recommended values, everything else is left as is. This is to avoid the previously confusing behavior of never being quite sure what gets overridden and what not despite the logger telling what it did override. The new behavior is completely unambiguous.

    See examples

    Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#shared-configuration

  2. If you are using massive models and aren’t using example scripts, make sure to read:

    Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#constructing-massive-models

Everything else should work as before or better.

The docs were revamped a lot too - if you find anything unclear or lacking please let me know.

If you encounter any problems please post an Issue and tag @stas00 to it.

Thank you!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
tjruwasecommented, May 3, 2021

@thies1006, there is now a PR for the assert:

1reaction
tjruwasecommented, May 2, 2021

@thies1006, thanks for reporting this issue. As @stas00 suggested, could please report this as a deepspeed issue? It would be great if you included the exact ds_config.json in the issue report. Thanks so much!

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed Integration
Integration of the core DeepSpeed features via Trainer. This is an everything-done-for-you type of integration - just supply your custom config file or...
Read more >
ZeRO-Infinity and DeepSpeed: Unlocking unprecedented ...
The DeepSpeed curated environment in Azure Machine Learning makes it easier for users to get started on Azure. DeepSpeed is now integrated ......
Read more >
Training your large model with DeepSpeed
DeepSpeed has been used to train or is in the process of training some of the largest dense ... You can find the...
Read more >
deepspeed
DeepSpeed addresses these challenges to accelerate model development and training. Installation. The quickest way to get started with DeepSpeed is via pip, ...
Read more >
deepspeed - Python Package Health Analysis
Learn more about deepspeed: package health score, popularity, security, maintenance, ... The quickest way to get started with DeepSpeed is via pip, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found