Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why ZeRO-2 use more CUDA Memory than ZeRO-1?

See original GitHub issue

Follow the bing_bert tutorial, my deepspeed_config is:

{
  "train_batch_size": 4096,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 6e-3,
      "betas": [
        0.9,
        0.99
      ],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "grad_hooks": true,
    "round_robin_gradients": false
  },


  "scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": 1e-8,
        "warmup_max_lr": 6e-3
    }
  },
  "gradient_clipping": 1.0,

  "wall_clock_breakdown": false,

  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },
  "sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 4,
    "num_global_blocks": 1,
    "attention": "bidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 4
  }
}

The CUDA Memory usage for stage 1 is 8900MB per GPU The CUDA Memory usage for stage 2 is 9600MB per GPU

And the ZeRO-2 is much slower than ZeRO-1 in training speed.

Any help will be appreciate~

Issue Analytics

State:
Created 2 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

tjruwasecommented, Aug 23, 2021

@dancingpipi, thanks for the questions.

ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are

All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.

Please see this #467 for a discussion on tuning ZeRO memory consumption.

1reaction

dancingpipicommented, Aug 19, 2021

Update: experiment for bert-large on 4xv100(16GB)

Batch Size = 64	NVIDIA-BERT	ZERO-0	ZERO-1	ZERO-2	ZERO-3
CUDA Memory(MB)	OOM	15853	13509	13499	14237
Forward time(ms)	/	98.19	98.3	96.88	317.15
Backward time(ms)	/	186.42	185.42	900.62	600.45
Total time(ms)	/	284.63	283.78	997.53	917.63
throughput(samples/s)	/	899.41	902.12	256.63	278.98

PS：backward = backward_inner + backward_allreduce,

| backward_inner | backward_allreduce – | – | – ZeRO-1 | 184.97 | 0.02 ZeRO-2 | 183.62 | 718.28 ZeRO-3 | 391.50 | 234.34

my question:

Why ZeRO2 and ZeRO3 are not superior to ZeRO1 at Memory Usage？
Why ZeRO2 backward is slower than ZeRO3? To my knowledge, ZeRO-2 does not require additional communication

Top Results From Across the Web

Why ZeRO-2 use more CUDA Memory than ZeRO-1？ · Issue #123 ...

The CUDA Memory usage for stage 2 is 9600MB per GPU. And the ZeRO-2 is much slower than ZeRO-1 in training speed. Any...

ZeRO-2 & DeepSpeed: Shattering barriers of deep learning ...

ZeRO-2 is also up to 5x faster than ZeRO-1 because its additional memory savings help reduce communication further and support even larger batch...

Zero Redundancy Optimizer - DeepSpeed

ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the ...

Fit More and Train Faster With ZeRO via DeepSpeed and ...

This blog post will describe how you can benefit from ZeRO regardless of whether you own just a single GPU or a whole...

Efficient Memory management | FairScale documentation

When using Data Parallel training, you tradeoff memory for ... When using multiple nodes, OSS can alternatively be faster or slower than vanilla...