Zero-Offload Doubles VRAM Usage
See original GitHub issueHi,
I am testing out DeepSpeed on a single GTX 1070 GPU. Everything works fine until I try to enable “cpu_offload” in the config, which then doubles GPU memory usage from 4GB to 8GB. CPU memory usage increases significantly as expected, so it seems that the data is being copied successfully. If I also enable “overlap_comm,” as recommended here, my system runs out of memory.
Here is my config file:
{ "train_batch_size": 96, "gradient_accumulation_steps": 1, "optimizer": { "type": "Adam", "params": { "lr": 0.00015 } }, "fp16": { "enabled": true }, "amp": { "enabled": false }, "gradient_clipping": 1.0, "zero_optimization": { "stage": 2, "cpu_offload": true, "contiguous_gradients": true, "overlap_comm": false } }
I am on Linux Mint 19 using PyTorch 1.7, CUDA 10.2, and latest version of DeepSpeed. Let me know if you need any more information. Thank you for your time.
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (14 by maintainers)
Top GitHub Comments
@stas00 Thanks for the nice summary of my discussion and catching my typo as well 😃. @szhengac, I hope you find this summary useful as well.
Hopefully, we can work together to enable the maximum benefits of zero-offload for your models. So my suggestion is that we start with small configuration to nail down the expected memory usage by running with batch size = 1 on 1 GPU.
This is the summary I saved away from this discussion:
ZeRO features that decrease gpu memory usage
CPU offload
ZeRO features that increase gpu memory usage
zero_optimization.allgather_bucket_size and zero_optimization.reduce_bucket_size have the biggest impact on memory usage during zero_optimization.stage=2
both default to 500000000 => 1GB buffer (5e8 x 2Bytes)
Overlap comm - increases gpu memory requirements
zero_optimization.overlap_comm=true trades off increase GPU RAM usage to lower all-reduce latency.
overlap_comm uses 4.5x the zero_optimization.allgather_bucket_size and zero_optimization.reduce_bucket_size resulting in 9GB footprint by default (5e8 x 2Bytes x 2 x 4.5), so for example this 2.5x change in buffer size may be a sufficient reduction:
“allgather_bucket_size”: 200000000, “reduce_bucket_size”: 200000000,