Why ZeRO-2 use more CUDA Memory than ZeRO-1?
See original GitHub issueFollow the bing_bert tutorial, my deepspeed_config is:
{
"train_batch_size": 4096,
"train_micro_batch_size_per_gpu": 32,
"steps_per_print": 1000,
"prescale_gradients": false,
"optimizer": {
"type": "Adam",
"params": {
"lr": 6e-3,
"betas": [
0.9,
0.99
],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"grad_hooks": true,
"round_robin_gradients": false
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 1e-8,
"warmup_max_lr": 6e-3
}
},
"gradient_clipping": 1.0,
"wall_clock_breakdown": false,
"fp16": {
"enabled": true,
"loss_scale": 0
},
"sparse_attention": {
"mode": "fixed",
"block": 16,
"different_layout_per_head": true,
"num_local_blocks": 4,
"num_global_blocks": 1,
"attention": "bidirectional",
"horizontal_global_attention": false,
"num_different_global_patterns": 4
}
}
The CUDA Memory usage for stage 1 is 8900MB per GPU The CUDA Memory usage for stage 2 is 9600MB per GPU
And the ZeRO-2 is much slower than ZeRO-1 in training speed.
Any help will be appreciate~
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Why ZeRO-2 use more CUDA Memory than ZeRO-1? · Issue #123 ...
The CUDA Memory usage for stage 2 is 9600MB per GPU. And the ZeRO-2 is much slower than ZeRO-1 in training speed. Any...
Read more >ZeRO-2 & DeepSpeed: Shattering barriers of deep learning ...
ZeRO-2 is also up to 5x faster than ZeRO-1 because its additional memory savings help reduce communication further and support even larger batch...
Read more >Zero Redundancy Optimizer - DeepSpeed
ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the ...
Read more >Fit More and Train Faster With ZeRO via DeepSpeed and ...
This blog post will describe how you can benefit from ZeRO regardless of whether you own just a single GPU or a whole...
Read more >Efficient Memory management | FairScale documentation
When using Data Parallel training, you tradeoff memory for ... When using multiple nodes, OSS can alternatively be faster or slower than vanilla...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@dancingpipi, thanks for the questions.
ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are
Please see this #467 for a discussion on tuning ZeRO memory consumption.
Update: experiment for bert-large on 4xv100(16GB)
PS:backward = backward_inner + backward_allreduce,
| backward_inner | backward_allreduce – | – | – ZeRO-1 | 184.97 | 0.02 ZeRO-2 | 183.62 | 718.28 ZeRO-3 | 391.50 | 234.34
my question: