setting accumulate_grad_batches (accumulate_grad_steps) >1 with deepspeed plugin and use cpu offload will lead to model training incorrectly
See original GitHub issue🐛 Bug
-
Setting accumulate_grad_batches (accumulate_grad_steps) > 1 with deepspeed plugin will lead to slow update of model params. I found the reason is the mismatch between ds.accumulate_grad_steps (deepspeed) and pl.accumulate_grad_batches (pytorch_lightning). It’s not allowed to specify ds.accumulate_grad_steps (config file or hparams) by pl.DeepSpeedPlugin which sets ds.accumulate_grad_steps equal to pl.accumulate_grad_batches forcely. However, pl.Trainer and ds.engine accumulate the training steps respectively. For example, I set pl.accumulate_grad_batches = 64 and ds.accumulate_grad_steps will be set to 64 automatically. When pl.Trainer triggers ds.engine.step() after 64 steps, ds.engine.micro_step += 1. So it needs 64 * 64 steps in trainer to make ds.engine trigger optimizer to step once and update model params. In addition, loss to backward in optimizer has been scaled twice (64 * 64) in this situation.
code in deepspeed/runtime/engine.py
class DeepSpeedEngine: ... def is_gradient_accumulation_boundary(self): return (self.micro_steps + 1) % self.gradient_accumulation_steps() == 0 ... def step(self, lr_kwargs=None): ... # Update the model when we reach gradient accumulation boundaries if self.is_gradient_accumulation_boundary(): if self.progressive_layer_drop: self.progressive_layer_drop.update_state(self.global_steps) self._take_model_step(lr_kwargs) ... self.micro_steps += 1
code in pytorch_lightning/trainer/training_loop.py
class TrainLoop: ... def should_accumulate(self): # checks if backward or backward + optimizer step (via closure) accumulation_done = self._accumulated_batches_reached() is_final_batch = self._num_training_batches_reached() return not (accumulation_done or is_final_batch) ... def run_training_batch(...): ... if self.should_accumulate(): # backward else: ... # actually call ds.engine.step() when using DeepSpeedPlugin self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
-
For the sake of training with deepspeed zero offload, I try to set ds.accumulate_grad_steps = 1 manually before ds.DeepSpeedEngine.initialize() in
pytorch_lightning/plugins/training_type/deepspeed.py
(and pl.accumulate_grad_batches = 64). It seems to work well at the start, and the loss start to reduce with a little bit faster (64 steps to update model params). However, it’s weired that loss still reduce slowly after warmup steps. Looking insight ZeRO Offload and deepspeed implement, I found it still has some mistakes in hparams setting. Model will do forward and backward at evey single step with mini batch, and the gradients calculated will be moved to cpu device (actually in memory) with cpu offload turning on. But deepspeed sets and overwrites the buffer to store the gradients in gpu device because it hits the accumulation boundary (ds.accumulate_grad_steps=1 and allreduce=True) at every training step inasync_inplace_copy_grad_to_fp32_buffer_from_gpu(param)
. After finishing every training batch (64), pl.trainer will trigger once of ds.engine.step() which calls the optimizer.step() and updates the partially updated params to model.code in deepspeed/runtime/zero/stage2.py
def copy_grads_in_partition(self, param): if self.cpu_offload: if self.gradient_accumulation_steps > 1: self.async_accumulate_grad_in_cpu_via_gpu(param) if self.is_gradient_accumulation_boundary: self.set_norm_for_param_grad_in_gpu(param) self.update_overflow_tracker_for_param_grad(param) self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(param) return ...
-
Finally, I set pl.accumulate_grad_batches = 1 and ds.accumulate_grad_steps = 64. Though there still has some problems (e.g. pl.trainer training process bar display and learning rate scheduler step mismatch), model starts to train normally. While training process bar display issue is not the big due, lr_scheduler step will lightly make some effects on the model training. The problem in lr_scheduler step is similar to previous problem. Trainer triggers lr_scheduler.step() once after a training batch (1 step), but ds.engine.optimizer updates params latter (64 steps). It will make the accumulated gradients sum up by weighted gradients with different learning rate. Thus, I overrided the optimizer_step() function and counted the lr_scheduler steps manually (don’t return lr_scheduler in configure_optimizers()). It’s obvious that the overrided optimizer_step() will not lead to the exception (raised by pytorch_lightning==v1.2.0) because of the pl.accumulate_grad_batches == 1 (but raise > 1) and model works fine.
./MYCODE.py
class MyPLModule(pl.LightningModule): ... def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs): if self.trainer.use_tpu or on_tpu: xm.optimizer_step(optimizer) else: optimizer.step(closure=optimizer_closure) optimizer.zero_grad() if self.mStep % 64 == 0: self.lr_scheduler.step() self.mStep += 1
Environment
- CUDA:
- GPU:
- GeForce GTX 1080
- available: True
- version: 11.1
- GPU:
- Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.8.0a0+186c3da
- pytorch-lightning: 1.2.0
- tqdm: 4.50.2
- transformer: 4.3.2
- deepspeed: 0.3.10
- System:
- OS: Linux
- architecture: - 64bit - ELF
- processor: x86_64
- python: 3.8.5
- version: #136-Ubuntu SMP Tue Jan 12 14:58:42 UTC 2021
- hparams
- deepspeed config
{ "fp16": { "enabled": true }, // "accumulate_grad_steps": 64, // setting in pytorch_lightning/plugins/training_type/deepspeed.py forcely "train_micro_batch_size_per_gpu": 1, "gradient_clipping": 1.0, "zero_allow_untested_optimizer": true, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 2e8, "reduce_scatter": true, "reduce_bucket_size": 2e8, "overlap_comm": true, "contiguous_gradients": true, "cpu_offload": true } }
- pytorch_lightning trainer config
args_dict = dict( data_dir="", output_dir="", model_name_or_path='google/mt5-small', tokenizer_name_or_path='google/mt5-small', max_seq_length=100, learning_rate=3e-4, weight_decay=0.0, adam_epsilon=1e-8, warmup_steps=1000, train_batch_size=1, eval_batch_size=1, num_train_epochs=1, gradient_accumulation_steps=64, # finally set to 1 n_gpu=1, fp_16=True, opt_level='O1', max_grad_norm=1.0, )
- deepspeed config
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:9 (5 by maintainers)
@tchaton Thanks to your invitation. I am very glad to take part in the contribution team of Lightning. 😃
Thanks @hwade 😃