Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add explicit gradient_accumulation_dtype config

See original GitHub issue

DeepSpeed has support for several dtypes now (i.e., fp32, fp16, bf16). However, it’s becoming less clear what parts of training are using what dtypes and what time. For example, in #1801 we added support for BF16 training + FP32 gradient accumulation and optimizer stage sharding (zero stage 1) when pipeline parallelism is enabled. This is only triggered if your config is in the following scenario:

Note: PP is used on the client side and not in the ds_config, but it’s use or not also decides what code paths are supported or not.

# pipeline-parallelism: enabled
"bf16": {
   "enabled": true
},
"zero_optimization": {
    "stage": 0
}

–> BF16 training + FP32 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/bf16_optimizer.py

# pipeline-parallelism: enabled
"bf16": {
   "enabled": true
},
"zero_optimization": {
    "stage": 1
}

–> BF16 training + BF16 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/zero/stage_1_and_2.py

The proposal is to introduce a config like the following:

"bf16": {
   "enabled": true
},
"gradient_accumulation_dtype": "fp32",
"zero_optimization": {
    "stage": 1
}

–>

The proposal is to add a new option in the ds_config: gradient_accumulation_dtype. In this case we would dispatch to the right version of ZeRO depending on what mode is selected by the user to make it more explicit what is happening.

I’ve started a table to try and express all of these possible cases and which ones would be supported and which would not. It feels a bit overly complicated in some ways however. This also doesn’t consider cases where zero is disabled "stage": 0.

bf16	fp16	grad-accu-dtype	PP	ZeRO (1,2,3)	Result	ZeRO implementation
T	T	*	*	*	Error

T	F	fp16	*	*	NotSupported
T	F	bf16	*	*	OKAY	stage_1_and_2.py
T	F	fp32	T	1	OKAY	bf16_optimizer.py
T	F	fp32	F	1	NotSupported
T	F	fp32	*	2 or 3	NotSupported

F	T	fp16	*	*	OKAY	stage_1_and_2.py
F	T	bf16 or fp32	*	*	NotSupported

F	F	fp32	*	*	OKAY	stage_1_and_2.py
F	F	bf16 or fp16	*	*	NotSupported

Note: this is a WIP but I don’t want to lose our progress on this discussion.

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:7 (5 by maintainers)

Top GitHub Comments

2reactions

tjruwasecommented, Dec 1, 2022

@stas00, @assij we have added initial support covering the combinations in the table. We would appreciate help testing the combinations that matter to you.

0reactions

stas00commented, Jul 6, 2022

BF16Optimizer is ZeRO stage 1, but currently it’s a bit of a hack and thus uses stage=0, it’s just differently implemented so can’t be used as normal stage-1 - this is because there is already bf16/stage1 which is a different beast which accumulates in bf16 which you don’t want as it won’t be very precise and thus the training won’t be as smooth.

Of course, it’d be great to find an intuitive solution here. But do not worry and use stage=0 here for now.

Top Results From Across the Web

Performing gradient accumulation with Accelerate

Gradient accumulation is a technique where you can train on bigger batch sizes than your machine would normally be able to fit into...

Understanding accumulated gradients in PyTorch

Gradient accumulation refers to the situation, where multiple backwards passes are performed before updating the parameters. The goal is to have ...

What is the relationship between gradient accumulation and ...

The core idea of gradient accumulation is to perform multiple backward passes using the same model parameters before updating them all at once ......

TPU - Gradient accumulation - Kaggle

It's kind tricky to implement gradient accumulation. ... tpu.master()) except ValueError: tpu = None if tpu: tf.config.experimental_connect_to_cluster(tpu) ...

CUDA Automatic Mixed Precision examples - PyTorch

Gradient accumulation adds gradients over an effective batch of size batch_per_iter * iters_to_accumulate ( * num_procs if distributed).