question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why ZeRO-2 use more CUDA Memory than ZeRO-1?

See original GitHub issue

Follow the bing_bert tutorial, my deepspeed_config is:

{
  "train_batch_size": 4096,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 6e-3,
      "betas": [
        0.9,
        0.99
      ],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "grad_hooks": true,
    "round_robin_gradients": false
  },


  "scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": 1e-8,
        "warmup_max_lr": 6e-3
    }
  },
  "gradient_clipping": 1.0,

  "wall_clock_breakdown": false,

  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },
  "sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 4,
    "num_global_blocks": 1,
    "attention": "bidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 4
  }
}

The CUDA Memory usage for stage 1 is 8900MB per GPU The CUDA Memory usage for stage 2 is 9600MB per GPU

And the ZeRO-2 is much slower than ZeRO-1 in training speed.

Any help will be appreciate~

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
tjruwasecommented, Aug 23, 2021

@dancingpipi, thanks for the questions.

ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are

  1. All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
  2. ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.

Please see this #467 for a discussion on tuning ZeRO memory consumption.

1reaction
dancingpipicommented, Aug 19, 2021

Update: experiment for bert-large on 4xv100(16GB)

Batch Size = 64 NVIDIA-BERT ZERO-0 ZERO-1 ZERO-2 ZERO-3
CUDA Memory(MB) OOM 15853 13509 13499 14237
Forward time(ms) / 98.19 98.3 96.88 317.15
Backward time(ms) / 186.42 185.42 900.62 600.45
Total time(ms) / 284.63 283.78 997.53 917.63
throughput(samples/s) / 899.41 902.12 256.63 278.98

PS:backward = backward_inner + backward_allreduce,

| backward_inner | backward_allreduce – | – | – ZeRO-1 | 184.97 | 0.02 ZeRO-2 | 183.62 | 718.28 ZeRO-3 | 391.50 | 234.34

my question:

  1. Why ZeRO2 and ZeRO3 are not superior to ZeRO1 at Memory Usage?
  2. Why ZeRO2 backward is slower than ZeRO3? To my knowledge, ZeRO-2 does not require additional communication
Read more comments on GitHub >

github_iconTop Results From Across the Web

Why ZeRO-2 use more CUDA Memory than ZeRO-1? · Issue #123 ...
The CUDA Memory usage for stage 2 is 9600MB per GPU. And the ZeRO-2 is much slower than ZeRO-1 in training speed. Any...
Read more >
ZeRO-2 & DeepSpeed: Shattering barriers of deep learning ...
ZeRO-2 is also up to 5x faster than ZeRO-1 because its additional memory savings help reduce communication further and support even larger batch...
Read more >
Zero Redundancy Optimizer - DeepSpeed
ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the ...
Read more >
Fit More and Train Faster With ZeRO via DeepSpeed and ...
This blog post will describe how you can benefit from ZeRO regardless of whether you own just a single GPU or a whole...
Read more >
Efficient Memory management | FairScale documentation
When using Data Parallel training, you tradeoff memory for ... When using multiple nodes, OSS can alternatively be faster or slower than vanilla...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found