Memory leak during GCN training
See original GitHub issue❓ Questions & Help
I’m currently training a 2-layer GCN model. I noticed that the memory needed for creating the model in each epoch is increasing heavily (already in the first epoch). During my trainings the memory consumption went up to more than 40 GB. Running the model in Google Colab even leads to a tcmalloc: large alloc
error. The memory issue happens on both GPU and CPU training. The dataset used for it has about 7000 lines of text (R8 Reuters dataset). I marked the exact line where the memory increases heavily in the code snippet below. Is there any way to optimize the memory usage to be able to actually run the model?
def train(train_data, val_data, saver):
train_data.init_node_feats(FLAGS.device)
val_data.init_node_feats(FLAGS.device)
model = create_model(train_data)
model = model.to(FLAGS.device)
pytorch_total_params = sum(p.numel() for p in model.parameters())
print("Number params: ", pytorch_total_params)
moving_avg = MovingAverage(FLAGS.validation_window_size, FLAGS.validation_metric != 'loss')
pyg_graph = train_data.get_pyg_graph(FLAGS.device)
optimizer = torch.optim.Adam(model.parameters(), lr=FLAGS.lr, )
epoch_losses = []
losses = []
for epoch in range(FLAGS.num_epochs):
t = time.time()
model.train()
model.zero_grad()
# Memory starts to increase here until it crashes
loss, preds_train = model(pyg_graph, train_data)
loss.backward()
optimizer.step()
loss = loss.item()
with torch.no_grad():
val_loss, preds_val = model(pyg_graph, val_data)
val_loss = val_loss.item()
eval_res_val = eval(preds_val, val_data, False)
epoch_losses.append(val_loss)
losses.append(loss)
best_model = saver.load_trained_model(train_data)
return best_model, model
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Memory leak during GCN training · Issue #1702 - GitHub
I'm currently training a 2-layer GCN model. I noticed that the memory needed for creating the model in each epoch is increasing heavily...
Read more >memory leak in keras while training a GAN - Stack Overflow
There is a known issue where a memory leak appears in TF 2.x keras when calling the network repeatedly in a loop.
Read more >Dealing with memory leak issue in Keras model training
Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA's Tesla V100-DGXS-32GB.
Read more >MVD: Memory-Related Vulnerability Detection Based on Flow ...
In this paper,we propose MVD, a statement-level Memory-related Vulnerability Detection approach based on flow-sensitive graph neural networks (FS-GNN). FS-GNN ...
Read more >Deep Graph Library - Deep Graph Library
Time out when lauching Distributed training ... Can the default implementation of GCN in DGL tutorial handle heterogeneous ... Memory leak in sparse.py?...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, you still need to use the edge list for
model.recon_loss
, but this does not result in a higher memory-footprint as you are required to compute edge representations anyway:You can reduce the memory footprint by only reconstructing a subset of positive edges though, for example via:
Thanks heaps for the suggestion! Cheers!