[DeepSpeed] ZeRO-Infinity integration: getting started and issues
See original GitHub issueDeepSpeed ZeRO-Infinity HF Integration is now available in the master branch of transformers
. Here is a quick getting started/what’s new post.
ZeRO-Infinity extends ZeRO-3 by extending CPU Offload with NVMe Offload, enabling training even bigger models. And it adds various other optimizations and improvements.
Getting started
Install the latest deepspeed
version:
pip install git+https://github.com/microsoft/DeepSpeed
You will want to be on a transformers master branch, if you want to run a quick test:
git clone https://github.com/huggingface/transformers
cd transformers
BS=4; PYTHONPATH=src USE_TF=0 deepspeed examples/pytorch/translation/run_translation.py \
--model_name_or_path t5-small --output_dir /tmp/zero3 --overwrite_output_dir --max_train_samples 64 \
--max_eval_samples 64 --max_source_length 128 --max_target_length 128 --val_max_target_length 128 \
--do_train --num_train_epochs 1 --per_device_train_batch_size $BS --per_device_eval_batch_size $BS \
--learning_rate 3e-3 --warmup_steps 500 --predict_with_generate --logging_steps 0 --save_steps 0 \
--eval_steps 1 --group_by_length --dataset_name wmt16 --dataset_config ro-en --source_lang en \
--target_lang ro --source_prefix "translate English to Romanian: " \
--deepspeed tests/deepspeed/ds_config_zero3.json
You will find a very detailed documentation here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed
Your new config file will look like this (for ZeRO-3 as an example):
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e14,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
If you want to experiment with NVMe offload, please see: https://huggingface.co/transformers/master/main_classes/trainer.html#nvme-support
Deepspeed currently runs only fp16-mixed precision
While deepspeed devs are working on the fp32 mode, at this moment only fp16-amp-like train/eval is available. So if your model struggles under fp16/amp it will have the same struggles under deepspeed.
Moreover, because deepspeed does model.half()
forcing all weights to fp16, some models might be ready for this (under AMP things are switched dynamically to fp16 where needed). If you run into this please post a new issue and we will try to find a solution/workaround for those special cases.
must use the latest transformers
master
If you get deepspeed errors like it doesn’t know what auto
value is, you aren’t on latest transformers
master branch, git pull
if you already have a clone and if you installed it already update your install.
For those who already use DeepSpeed HF integration
As the integration part is evolving it has gone through a major revamp and various improvements.
There are 2 important changes that you need to be aware of if you’re already using DeepSpeed integration in transformers
:
-
After this release only config params that are set to
auto
will get automatically overriden/set to the correct/recommended values, everything else is left as is. This is to avoid the previously confusing behavior of never being quite sure what gets overridden and what not despite the logger telling what it did override. The new behavior is completely unambiguous.See examples
Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#shared-configuration
-
If you are using massive models and aren’t using example scripts, make sure to read:
Full doc: https://huggingface.co/transformers/master/main_classes/trainer.html#constructing-massive-models
Everything else should work as before or better.
The docs were revamped a lot too - if you find anything unclear or lacking please let me know.
If you encounter any problems please post an Issue and tag @stas00
to it.
Thank you!
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
@thies1006, there is now a PR for the assert:
@thies1006, thanks for reporting this issue. As @stas00 suggested, could please report this as a deepspeed issue? It would be great if you included the exact ds_config.json in the issue report. Thanks so much!