question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add explicit gradient_accumulation_dtype config

See original GitHub issue

DeepSpeed has support for several dtypes now (i.e., fp32, fp16, bf16). However, it’s becoming less clear what parts of training are using what dtypes and what time. For example, in #1801 we added support for BF16 training + FP32 gradient accumulation and optimizer stage sharding (zero stage 1) when pipeline parallelism is enabled. This is only triggered if your config is in the following scenario:

Note: PP is used on the client side and not in the ds_config, but it’s use or not also decides what code paths are supported or not.

# pipeline-parallelism: enabled
"bf16": {
   "enabled": true
},
"zero_optimization": {
    "stage": 0
}

–> BF16 training + FP32 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/bf16_optimizer.py

# pipeline-parallelism: enabled
"bf16": {
   "enabled": true
},
"zero_optimization": {
    "stage": 1
}

–> BF16 training + BF16 gradient accumulation + ZeRO stage 1 optimizer sharding via deepspeed/runtime/zero/stage_1_and_2.py

The proposal is to introduce a config like the following:

"bf16": {
   "enabled": true
},
"gradient_accumulation_dtype": "fp32",
"zero_optimization": {
    "stage": 1
}

–>

The proposal is to add a new option in the ds_config: gradient_accumulation_dtype. In this case we would dispatch to the right version of ZeRO depending on what mode is selected by the user to make it more explicit what is happening.

I’ve started a table to try and express all of these possible cases and which ones would be supported and which would not. It feels a bit overly complicated in some ways however. This also doesn’t consider cases where zero is disabled "stage": 0.

bf16 fp16 grad-accu-dtype PP ZeRO (1,2,3) Result ZeRO implementation
T T * * * Error
T F fp16 * * NotSupported
T F bf16 * * OKAY stage_1_and_2.py
T F fp32 T 1 OKAY bf16_optimizer.py
T F fp32 F 1 NotSupported
T F fp32 * 2 or 3 NotSupported
F T fp16 * * OKAY stage_1_and_2.py
F T bf16 or fp32 * * NotSupported
F F fp32 * * OKAY stage_1_and_2.py
F F bf16 or fp16 * * NotSupported

Note: this is a WIP but I don’t want to lose our progress on this discussion.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:3
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
tjruwasecommented, Dec 1, 2022

@stas00, @assij we have added initial support covering the combinations in the table. We would appreciate help testing the combinations that matter to you.

0reactions
stas00commented, Jul 6, 2022

BF16Optimizer is ZeRO stage 1, but currently it’s a bit of a hack and thus uses stage=0, it’s just differently implemented so can’t be used as normal stage-1 - this is because there is already bf16/stage1 which is a different beast which accumulates in bf16 which you don’t want as it won’t be very precise and thus the training won’t be as smooth.

Of course, it’d be great to find an intuitive solution here. But do not worry and use stage=0 here for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performing gradient accumulation with Accelerate
Gradient accumulation is a technique where you can train on bigger batch sizes than your machine would normally be able to fit into...
Read more >
Understanding accumulated gradients in PyTorch
Gradient accumulation refers to the situation, where multiple backwards passes are performed before updating the parameters. The goal is to have ...
Read more >
What is the relationship between gradient accumulation and ...
The core idea of gradient accumulation is to perform multiple backward passes using the same model parameters before updating them all at once ......
Read more >
TPU - Gradient accumulation - Kaggle
It's kind tricky to implement gradient accumulation. ... tpu.master()) except ValueError: tpu = None if tpu: tf.config.experimental_connect_to_cluster(tpu) ...
Read more >
CUDA Automatic Mixed Precision examples - PyTorch
Gradient accumulation adds gradients over an effective batch of size batch_per_iter * iters_to_accumulate ( * num_procs if distributed).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found