question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Offload memory usage not performing as expected

See original GitHub issue

I’m working with graph neural networks for particle physics, where we often have large graphs that cannot fit all gradients on the GPU simultaneously. I’m hoping to use Stage 3 offloading to move parameters/gradients off the GPU and train these graphs. However after also trying this with Fairscale, I’m settling for a simpler toy model to understand the memory behaviour. Here is the toy:

To Reproduce

num_graphs = 10
num_input = 3
num_edges = 1
num_outputs = 1

num_hidden =  1024
num_layers =  100

edges = torch.rand(num_graphs, num_edges, num_input)
truth = torch.round(torch.rand(num_graphs, num_edges))
graph_data = torch.utils.data.TensorDataset(edges, truth)
dataloader = DataLoader(graph_data, batch_size=1)

net = torch.nn.Sequential(
    torch.nn.Linear(num_input, num_hidden),
    *([torch.nn.Linear(num_hidden, num_hidden) for _ in range(num_layers)]),
    torch.nn.Linear(num_hidden, num_outputs),
)

Obviously this is not a GNN in any sense - it is just a binary classifier sequential model. I train this with:

criterion = torch.nn.BCEWithLogitsLoss()
        
for step, (batch_edges, batch_truth) in enumerate(dataloader):
    torch.cuda.reset_peak_memory_stats()   
    
    #forward() method
    batch_edges, batch_truth = batch_edges.to("cuda").squeeze(0), batch_truth.to("cuda").squeeze(0)
    output = model_engine(batch_edges)    
    loss = criterion(output.squeeze(1), target=batch_truth)
    
    #runs backpropagation
    model_engine.backward(loss)
    
    #weight update
    model_engine.step()
    print(f'Using memory: {torch.cuda.max_memory_allocated()/1024**3} Gb')

Expected behavior I would expect that this large model should consume a lot of memory with no offloading, but be sharded to much smaller memory usage with stage 3 offloading. In fact, it’s the opposite: with no offloading this model requires around 1.9Gb, and with offloading requires 4.1Gb. Aside from other memory saving techniques (mixed precision, activation checkpointing), I’m hoping to understand why offloading by itself is not delivering a smaller memory footprint.

Am I missing something obvious here? Do I have the wrong idea of how Zero 3 offload is meant to work? If I can’t get this toy to use less memory, I don’t see how the more complicated GNN architecture would benefit.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/global/homes/d/danieltm/.conda/envs/exatrkx-test/lib/python3.7/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/global/homes/d/danieltm/.conda/envs/exatrkx-test/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2

System info (please complete the following information): 1 GPU - V100

FYI: Config file

{
    "train_batch_size": 1,
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "cpu"
}}}

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
tjruwasecommented, Oct 7, 2021

@murnanedaniel, thanks for the report.

Can you please add deepspeed’s memory usage profiler to your model?

  1. import like here
  2. use like here

And please share the resulting log. I am particularly interested in the memory usage before forward, backward, and step. Thanks.

0reactions
taehyunzzzcommented, Apr 21, 2022

before nothing optimizer optimizer+params forward 0.39 0.39 0.01 backward 0.39 0.40 0.40 optimizer 0.78 0.40 0.01 Above is a summary from the memory usage logs. This is showing the memory utilization (in GB) before first forward, backward, and optimizer calls when offloading nothing, optimizer, and optimizer+params. The MA metric captures the actual usage. As you can see offloading reduces GPU memory usage. You might also notice that memory usage for nothing increases for later batches (up to 1.56GB) whereas the usage remains stable for offloading.

How did you obtain the 1.9GB and 4.1GB memory usage that you attributed to nothing and offloading? If it is through nvidia-smi, then note that nvidia-smi captures the total GPU memory cached by the PyTorch process and not the actively used memory. So nvidia-smi is not a good estimator for GPU memory use, especially for relatively small models like your example. Hope that helps.

Does this mean it’s a Pytorch issue, for not letting go of the inactive memory? Is there a way to keep Pytorch from reserving so much memory or to free some memory? Simply del -ing some intermediate variables didn’t seem to help much 😦

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Troubleshoot Low RAM or Memory Leaks in Windows
It isn't 100% accurate, but might give you some insight. Open the Task Manager, and browse to the Performance tab.
Read more >
Troubleshoot out of memory or low memory issues in SQL ...
Provides troubleshooting steps to address out of memory or low memory issues in SQL Server.
Read more >
Solved: Re: Memory Maxed & not being released/Working Set
yes, I understand that, but what I am saying is that QlikView is not releasing memory at it's working load. The zero sessions...
Read more >
Offloading items from memory: individual differences in ...
Offloading can lead to improved performance on ongoing tasks with ... was not correlated with objective memory performance (Gilbert, 2015b).
Read more >
131456 - Memory use does not go down after closing tabs ...
Watch memory use Actual Results: When first loaded: -Mozilla eats 8% of my ... loading bugzilla to file this bug: -Mozilla eats 21.5%!...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found