question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak during GCN training

See original GitHub issue

❓ Questions & Help

I’m currently training a 2-layer GCN model. I noticed that the memory needed for creating the model in each epoch is increasing heavily (already in the first epoch). During my trainings the memory consumption went up to more than 40 GB. Running the model in Google Colab even leads to a tcmalloc: large alloc error. The memory issue happens on both GPU and CPU training. The dataset used for it has about 7000 lines of text (R8 Reuters dataset). I marked the exact line where the memory increases heavily in the code snippet below. Is there any way to optimize the memory usage to be able to actually run the model?

def train(train_data, val_data, saver):
   train_data.init_node_feats(FLAGS.device)
   val_data.init_node_feats(FLAGS.device)
   model = create_model(train_data)
   model = model.to(FLAGS.device)
   pytorch_total_params = sum(p.numel() for p in model.parameters())
   print("Number params: ", pytorch_total_params)
   moving_avg = MovingAverage(FLAGS.validation_window_size, FLAGS.validation_metric != 'loss')
   pyg_graph = train_data.get_pyg_graph(FLAGS.device)
   optimizer = torch.optim.Adam(model.parameters(), lr=FLAGS.lr, )

   epoch_losses = []
   losses = []
   for epoch in range(FLAGS.num_epochs):
       t = time.time()
       model.train()
       model.zero_grad()
   	# Memory starts to increase here until it crashes
       loss, preds_train = model(pyg_graph, train_data)
       loss.backward()
       optimizer.step()

       loss = loss.item()
       with torch.no_grad():
           val_loss, preds_val = model(pyg_graph, val_data)
           val_loss = val_loss.item()
           eval_res_val = eval(preds_val, val_data, False)
           epoch_losses.append(val_loss)
           losses.append(loss)
       
   best_model = saver.load_trained_model(train_data)
   return best_model, model

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
rusty1scommented, May 20, 2021

Yes, you still need to use the edge list for model.recon_loss, but this does not result in a higher memory-footprint as you are required to compute edge representations anyway:

z = model.encode(x, adj_t)
col, row, _ = adj_t.coo()
loss = model.recon_loss(z, torch.stack([row, col], dim=0)

You can reduce the memory footprint by only reconstructing a subset of positive edges though, for example via:

z = model.encode(x, adj_t)
col, row, _ = adj_t.coo()
perm = torch.randperm(col.size(0))[:1000]  # Only reconstruct 1000 random samples
loss = model.recon_loss(z, torch.stack([row[perm], col[perm]], dim=0)
0reactions
anuradhawickcommented, May 20, 2021

Thanks heaps for the suggestion! Cheers!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory leak during GCN training · Issue #1702 - GitHub
I'm currently training a 2-layer GCN model. I noticed that the memory needed for creating the model in each epoch is increasing heavily...
Read more >
memory leak in keras while training a GAN - Stack Overflow
There is a known issue where a memory leak appears in TF 2.x keras when calling the network repeatedly in a loop.
Read more >
Dealing with memory leak issue in Keras model training
Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA's Tesla V100-DGXS-32GB.
Read more >
MVD: Memory-Related Vulnerability Detection Based on Flow ...
In this paper,we propose MVD, a statement-level Memory-related Vulnerability Detection approach based on flow-sensitive graph neural networks (FS-GNN). FS-GNN ...
Read more >
Deep Graph Library - Deep Graph Library
Time out when lauching Distributed training ... Can the default implementation of GCN in DGL tutorial handle heterogeneous ... Memory leak in sparse.py?...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found