Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

setting accumulate_grad_batches (accumulate_grad_steps) >1 with deepspeed plugin and use cpu offload will lead to model training incorrectly

See original GitHub issue

🐛 Bug

Setting accumulate_grad_batches (accumulate_grad_steps) > 1 with deepspeed plugin will lead to slow update of model params. I found the reason is the mismatch between ds.accumulate_grad_steps (deepspeed) and pl.accumulate_grad_batches (pytorch_lightning). It’s not allowed to specify ds.accumulate_grad_steps (config file or hparams) by pl.DeepSpeedPlugin which sets ds.accumulate_grad_steps equal to pl.accumulate_grad_batches forcely. However, pl.Trainer and ds.engine accumulate the training steps respectively. For example, I set pl.accumulate_grad_batches = 64 and ds.accumulate_grad_steps will be set to 64 automatically. When pl.Trainer triggers ds.engine.step() after 64 steps, ds.engine.micro_step += 1. So it needs 64 * 64 steps in trainer to make ds.engine trigger optimizer to step once and update model params. In addition, loss to backward in optimizer has been scaled twice (64 * 64) in this situation.

code in deepspeed/runtime/engine.py

class DeepSpeedEngine:
...
    def is_gradient_accumulation_boundary(self):
        return (self.micro_steps + 1) % self.gradient_accumulation_steps() == 0
    ...
    def step(self, lr_kwargs=None):
        ...
        # Update the model when we reach gradient accumulation boundaries
        if self.is_gradient_accumulation_boundary():
            if self.progressive_layer_drop:
                self.progressive_layer_drop.update_state(self.global_steps)
            self._take_model_step(lr_kwargs) 
        ...
        self.micro_steps += 1

code in pytorch_lightning/trainer/training_loop.py

class TrainLoop:
    ...
    def should_accumulate(self):
        # checks if backward or backward + optimizer step (via closure)
        accumulation_done = self._accumulated_batches_reached()
        is_final_batch = self._num_training_batches_reached()
        return not (accumulation_done or is_final_batch)
    ...
    def run_training_batch(...):
        ...
        if self.should_accumulate():
            # backward
        else:
            ...
            # actually call ds.engine.step() when using DeepSpeedPlugin
            self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)

For the sake of training with deepspeed zero offload, I try to set ds.accumulate_grad_steps = 1 manually before ds.DeepSpeedEngine.initialize() in pytorch_lightning/plugins/training_type/deepspeed.py (and pl.accumulate_grad_batches = 64). It seems to work well at the start, and the loss start to reduce with a little bit faster (64 steps to update model params). However, it’s weired that loss still reduce slowly after warmup steps. Looking insight ZeRO Offload and deepspeed implement, I found it still has some mistakes in hparams setting. Model will do forward and backward at evey single step with mini batch, and the gradients calculated will be moved to cpu device (actually in memory) with cpu offload turning on. But deepspeed sets and overwrites the buffer to store the gradients in gpu device because it hits the accumulation boundary (ds.accumulate_grad_steps=1 and allreduce=True) at every training step in async_inplace_copy_grad_to_fp32_buffer_from_gpu(param). After finishing every training batch (64), pl.trainer will trigger once of ds.engine.step() which calls the optimizer.step() and updates the partially updated params to model.

code in deepspeed/runtime/zero/stage2.py
```
def copy_grads_in_partition(self, param):
    if self.cpu_offload:

        if self.gradient_accumulation_steps > 1:
            self.async_accumulate_grad_in_cpu_via_gpu(param)

        if self.is_gradient_accumulation_boundary:
            self.set_norm_for_param_grad_in_gpu(param)

            self.update_overflow_tracker_for_param_grad(param)

            self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(param)

        return 
    ...
```
Finally, I set pl.accumulate_grad_batches = 1 and ds.accumulate_grad_steps = 64. Though there still has some problems (e.g. pl.trainer training process bar display and learning rate scheduler step mismatch), model starts to train normally. While training process bar display issue is not the big due, lr_scheduler step will lightly make some effects on the model training. The problem in lr_scheduler step is similar to previous problem. Trainer triggers lr_scheduler.step() once after a training batch (1 step), but ds.engine.optimizer updates params latter (64 steps). It will make the accumulated gradients sum up by weighted gradients with different learning rate. Thus, I overrided the optimizer_step() function and counted the lr_scheduler steps manually (don’t return lr_scheduler in configure_optimizers()). It’s obvious that the overrided optimizer_step() will not lead to the exception (raised by pytorch_lightning==v1.2.0) because of the pl.accumulate_grad_batches == 1 (but raise > 1) and model works fine.

./MYCODE.py
```
class MyPLModule(pl.LightningModule):
    ...
    def optimizer_step(self, epoch, batch_idx, optimizer,
                        optimizer_idx, optimizer_closure, on_tpu,
                        using_native_amp, using_lbfgs):
        if self.trainer.use_tpu or on_tpu:
            xm.optimizer_step(optimizer)
        else:
            optimizer.step(closure=optimizer_closure)
        optimizer.zero_grad()
    
        if self.mStep % 64 == 0:
            self.lr_scheduler.step()
        self.mStep += 1
```

Environment

CUDA:
- GPU:
  - GeForce GTX 1080
- available: True
- version: 11.1
Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.8.0a0+186c3da
- pytorch-lightning: 1.2.0
- tqdm: 4.50.2
- transformer: 4.3.2
- deepspeed: 0.3.10
System:
- OS: Linux
- architecture: - 64bit - ELF
- processor: x86_64
- python: 3.8.5
- version: #136-Ubuntu SMP Tue Jan 12 14:58:42 UTC 2021

hparams

deepspeed config

{
    "fp16": {
        "enabled": true
    },
    // "accumulate_grad_steps": 64, // setting in pytorch_lightning/plugins/training_type/deepspeed.py forcely
    "train_micro_batch_size_per_gpu": 1,
    "gradient_clipping": 1.0,
    "zero_allow_untested_optimizer": true,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    }
}

pytorch_lightning trainer config

args_dict = dict(
    data_dir="",  
    output_dir="",
    model_name_or_path='google/mt5-small',
    tokenizer_name_or_path='google/mt5-small',
    max_seq_length=100,
    learning_rate=3e-4,
    weight_decay=0.0,
    adam_epsilon=1e-8,
    warmup_steps=1000,
    train_batch_size=1,
    eval_batch_size=1,
    num_train_epochs=1,
    gradient_accumulation_steps=64, # finally set to 1
    n_gpu=1,
    fp_16=True,
    opt_level='O1',  
    max_grad_norm=1.0, 
)

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

hwadecommented, Mar 18, 2021

@tchaton Thanks to your invitation. I am very glad to take part in the contribution team of Lightning. 😃

0reactions

SeanNarencommented, May 24, 2021

Thanks @hwade 😃

Top Results From Across the Web

setting accumulate_grad_batches (accumulate_grad_steps ...

setting accumulate_grad_batches (accumulate_grad_steps) >1 with deepspeed plugin and use cpu offload will lead to model training incorrectly ...

DeepSpeed - Hugging Face

DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 as well as CPU/Disk offload of optimizer states, gradients and parameters....

ZeRO-Offload - DeepSpeed

ZeRO-Offload is a ZeRO optimization that offloads the optimizer memory and computation from the GPU to the host CPU. ZeRO-Offload enables large models...

What is ZeRO-Offloading? - Efficient DL

ZeRO-Offloading is a way of reducing GPU memory usage during neural network training by offloading data and compute from the GPU(s) to CPU....

deepspeed - PyPI

Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training ...