question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Are ZeRO CPU offload and gradient accumulation compatible?

See original GitHub issue

I’m trying out @stas00 's HuggingFace DeepSpeed integration and it’s super cool!

But I’m running into an error when I try to enable both cpu offload and gradient accumulation at the same time, and I’m not sure if my problem is on the HuggingFace side, or the DeepSpeed side, or (most likely) between my chair and keyboard. Since this post is in the DeepSpeed project, I’ll leave out the HuggingFace specifics for now.

My training script will run just fine with either cpu_offload=true or --gradient_accumulation_steps > 1, but if I try using both, it throws the following:

  File "bin/train.py", line 306, in <module>
    main()
  File "bin/train.py", line 265, in main
    train_result = trainer.train()
  File "/opt/miniconda3/envs/hf/lib/python3.8/site-packages/transformers/trainer.py", line 921, in train
    self.optimizer.step()
  File "/opt/miniconda3/envs/hf/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1378, in step
    self.complete_grad_norm_calculation_for_cpu_offload(
  File "/opt/miniconda3/envs/hf/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 881, in complete_grad_norm_calculation_for_cpu_offload
    param_norm = self.norm_for_param_grads[param_id]
KeyError: 0
Traceback (most recent call last):
  File "bin/train.py", line 306, in <module>
    main()
  File "bin/train.py", line 265, in main
    train_result = trainer.train()
  File "/opt/miniconda3/envs/hf/lib/python3.8/site-packages/transformers/trainer.py", line 921, in train
    self.optimizer.step()
  File "/opt/miniconda3/envs/hf/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 1378, in step
    self.complete_grad_norm_calculation_for_cpu_offload(
  File "/opt/miniconda3/envs/hf/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 881, in complete_grad_norm_calculation_for_cpu_offload
    param_norm = self.norm_for_param_grads[param_id]
KeyError: 130

I’m assuming it’s because I haven’t configured DeepSpeed or my optimizer correctly. But before I dig too much deeper, I wanted to make sure that using both was supported. I haven’t seen anything in the documentation that would indicate that it wasn’t.

@stas00 have you tried both simultaneously in your HuggingFace integration testing?

This is my deepspeed config json:

{
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },

  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 1e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e8,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "cpu_offload": true
  },

  "optimizer": {
    "type": "Adam",
    "params": {
      "adam_w_mode": true,
      "lr": 3e-5,
      "betas": [ 0.9, 0.999 ],
      "eps": 1e-8,
      "weight_decay": 0.1
    }
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": 0,
        "warmup_max_lr": 3e-5,
        "warmup_num_steps": 500
    }
  }
}

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Jan 15, 2021

@stas00 have you tried both simultaneously in your HuggingFace integration testing?

I haven’t tried that combination yet, and if I do I get the same error as you.

Let me investigate to ensure it’s not something missing on our side.

0reactions
jncaseycommented, Jan 15, 2021

Awesome!

Read more comments on GitHub >

github_iconTop Results From Across the Web

DeepSpeed - Hugging Face
DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 as well as CPU/Disk offload of optimizer states, gradients and parameters....
Read more >
ZeRO — DeepSpeed 0.7.6 documentation
DeepSpeed first included offloading capabilities with ZeRO-Offload, a system for offloading optimizer and gradient states to CPU memory within ZeRO-2.
Read more >
ZeRO-Offload - DeepSpeed
ZeRO -Offload reduces the GPU compute and memory requirements of such models by leveraging compute and memory resources on the host CPU to...
Read more >
ZeRO-Offload: Democratizing Billion-Scale Model Training
to reduce GPU memory requirement by exploiting CPU memory. ... tioned regards, we must offload the gradients, optimizer states.
Read more >
DeepSpeed: Extreme-scale model training for everyone
The key technology behind ZeRO-Offload is our new capability to offload optimizer states and gradients onto CPU memory, building on top of ZeRO- ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found