[BUG] Offload memory usage not performing as expected
See original GitHub issueI’m working with graph neural networks for particle physics, where we often have large graphs that cannot fit all gradients on the GPU simultaneously. I’m hoping to use Stage 3 offloading to move parameters/gradients off the GPU and train these graphs. However after also trying this with Fairscale, I’m settling for a simpler toy model to understand the memory behaviour. Here is the toy:
To Reproduce
num_graphs = 10
num_input = 3
num_edges = 1
num_outputs = 1
num_hidden = 1024
num_layers = 100
edges = torch.rand(num_graphs, num_edges, num_input)
truth = torch.round(torch.rand(num_graphs, num_edges))
graph_data = torch.utils.data.TensorDataset(edges, truth)
dataloader = DataLoader(graph_data, batch_size=1)
net = torch.nn.Sequential(
torch.nn.Linear(num_input, num_hidden),
*([torch.nn.Linear(num_hidden, num_hidden) for _ in range(num_layers)]),
torch.nn.Linear(num_hidden, num_outputs),
)
Obviously this is not a GNN in any sense - it is just a binary classifier sequential model. I train this with:
criterion = torch.nn.BCEWithLogitsLoss()
for step, (batch_edges, batch_truth) in enumerate(dataloader):
torch.cuda.reset_peak_memory_stats()
#forward() method
batch_edges, batch_truth = batch_edges.to("cuda").squeeze(0), batch_truth.to("cuda").squeeze(0)
output = model_engine(batch_edges)
loss = criterion(output.squeeze(1), target=batch_truth)
#runs backpropagation
model_engine.backward(loss)
#weight update
model_engine.step()
print(f'Using memory: {torch.cuda.max_memory_allocated()/1024**3} Gb')
Expected behavior I would expect that this large model should consume a lot of memory with no offloading, but be sharded to much smaller memory usage with stage 3 offloading. In fact, it’s the opposite: with no offloading this model requires around 1.9Gb, and with offloading requires 4.1Gb. Aside from other memory saving techniques (mixed precision, activation checkpointing), I’m hoping to understand why offloading by itself is not delivering a smaller memory footprint.
Am I missing something obvious here? Do I have the wrong idea of how Zero 3 offload is meant to work? If I can’t get this toy to use less memory, I don’t see how the more complicated GNN architecture would benefit.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/global/homes/d/danieltm/.conda/envs/exatrkx-test/lib/python3.7/site-packages/torch']
torch version .................... 1.9.1+cu102
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/global/homes/d/danieltm/.conda/envs/exatrkx-test/lib/python3.7/site-packages/deepspeed']
deepspeed info ................... 0.5.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 10.2
System info (please complete the following information): 1 GPU - V100
FYI: Config file
{
"train_batch_size": 1,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu"
},
"offload_param": {
"device": "cpu"
}}}
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
@murnanedaniel, thanks for the report.
Can you please add deepspeed’s memory usage profiler to your model?
And please share the resulting log. I am particularly interested in the memory usage before forward, backward, and step. Thanks.
Does this mean it’s a Pytorch issue, for not letting go of the inactive memory? Is there a way to keep Pytorch from reserving so much memory or to free some memory? Simply
del
-ing some intermediate variables didn’t seem to help much 😦