Add explicit gradient_accumulation_dtype config
See original GitHub issueDeepSpeed has support for several dtypes now (i.e., fp32, fp16, bf16). However, it’s becoming less clear what parts of training are using what dtypes and what time. For example, in #1801 we added support for BF16 training + FP32 gradient accumulation and optimizer stage sharding (zero stage 1) when pipeline parallelism is enabled. This is only triggered if your config is in the following scenario:
Note: PP is used on the client side and not in the ds_config, but it’s use or not also decides what code paths are supported or not.
# pipeline-parallelism: enabled
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 0
}
–> BF16 training + FP32 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/bf16_optimizer.py
# pipeline-parallelism: enabled
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 1
}
–> BF16 training + BF16 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/zero/stage_1_and_2.py
The proposal is to introduce a config like the following:
"bf16": {
"enabled": true
},
"gradient_accumulation_dtype": "fp32",
"zero_optimization": {
"stage": 1
}
–>
The proposal is to add a new option in the ds_config: gradient_accumulation_dtype
. In this case we would dispatch to the right version of ZeRO depending on what mode is selected by the user to make it more explicit what is happening.
I’ve started a table to try and express all of these possible cases and which ones would be supported and which would not. It feels a bit overly complicated in some ways however. This also doesn’t consider cases where zero is disabled "stage": 0
.
bf16 | fp16 | grad-accu-dtype | PP | ZeRO (1,2,3) | Result | ZeRO implementation |
---|---|---|---|---|---|---|
T | T | * | * | * | Error | |
T | F | fp16 | * | * | NotSupported | |
T | F | bf16 | * | * | OKAY | stage_1_and_2.py |
T | F | fp32 | T | 1 | OKAY | bf16_optimizer.py |
T | F | fp32 | F | 1 | NotSupported | |
T | F | fp32 | * | 2 or 3 | NotSupported | |
F | T | fp16 | * | * | OKAY | stage_1_and_2.py |
F | T | bf16 or fp32 | * | * | NotSupported | |
F | F | fp32 | * | * | OKAY | stage_1_and_2.py |
F | F | bf16 or fp16 | * | * | NotSupported |
Note: this is a WIP but I don’t want to lose our progress on this discussion.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:7 (5 by maintainers)
Top GitHub Comments
@stas00, @assij we have added initial support covering the combinations in the table. We would appreciate help testing the combinations that matter to you.
BF16Optimizer is ZeRO stage 1, but currently it’s a bit of a hack and thus uses stage=0, it’s just differently implemented so can’t be used as normal stage-1 - this is because there is already bf16/stage1 which is a different beast which accumulates in bf16 which you don’t want as it won’t be very precise and thus the training won’t be as smooth.
Of course, it’d be great to find an intuitive solution here. But do not worry and use stage=0 here for now.