Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Zero-Offload Doubles VRAM Usage

See original GitHub issue

Hi,

I am testing out DeepSpeed on a single GTX 1070 GPU. Everything works fine until I try to enable “cpu_offload” in the config, which then doubles GPU memory usage from 4GB to 8GB. CPU memory usage increases significantly as expected, so it seems that the data is being copied successfully. If I also enable “overlap_comm,” as recommended here, my system runs out of memory.

Here is my config file: { "train_batch_size": 96, "gradient_accumulation_steps": 1, "optimizer": { "type": "Adam", "params": { "lr": 0.00015 } }, "fp16": { "enabled": true }, "amp": { "enabled": false }, "gradient_clipping": 1.0, "zero_optimization": { "stage": 2, "cpu_offload": true, "contiguous_gradients": true, "overlap_comm": false } }

I am on Linux Mint 19 using PyTorch 1.7, CUDA 10.2, and latest version of DeepSpeed. Let me know if you need any more information. Thank you for your time.

Issue Analytics

State:
Created 3 years ago
Comments:17 (14 by maintainers)

Top GitHub Comments

2reactions

tjruwasecommented, Dec 22, 2020

@stas00 Thanks for the nice summary of my discussion and catching my typo as well 😃. @szhengac, I hope you find this summary useful as well.

Hopefully, we can work together to enable the maximum benefits of zero-offload for your models. So my suggestion is that we start with small configuration to nail down the expected memory usage by running with batch size = 1 on 1 GPU.

2reactions

stas00commented, Dec 22, 2020

This is the summary I saved away from this discussion:

ZeRO features that decrease gpu memory usage

CPU offload
- zero_optimization.cpu_offload=true, requires zero_optimization.stage=2
- cpu_offload should reduce GPU RAM usage,

ZeRO features that increase gpu memory usage

zero_optimization.allgather_bucket_size and zero_optimization.reduce_bucket_size have the biggest impact on memory usage during zero_optimization.stage=2

both default to 500000000 => 1GB buffer (5e8 x 2Bytes)
Overlap comm - increases gpu memory requirements
- zero_optimization.overlap_comm=true trades off increase GPU RAM usage to lower all-reduce latency.
- overlap_comm uses 4.5x the zero_optimization.allgather_bucket_size and zero_optimization.reduce_bucket_size resulting in 9GB footprint by default (5e8 x 2Bytes x 2 x 4.5), so for example this 2.5x change in buffer size may be a sufficient reduction:
  
  “allgather_bucket_size”: 200000000, “reduce_bucket_size”: 200000000,