Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory leak during GCN training

See original GitHub issue

❓ Questions & Help

I’m currently training a 2-layer GCN model. I noticed that the memory needed for creating the model in each epoch is increasing heavily (already in the first epoch). During my trainings the memory consumption went up to more than 40 GB. Running the model in Google Colab even leads to a tcmalloc: large alloc error. The memory issue happens on both GPU and CPU training. The dataset used for it has about 7000 lines of text (R8 Reuters dataset). I marked the exact line where the memory increases heavily in the code snippet below. Is there any way to optimize the memory usage to be able to actually run the model?

def train(train_data, val_data, saver):
   train_data.init_node_feats(FLAGS.device)
   val_data.init_node_feats(FLAGS.device)
   model = create_model(train_data)
   model = model.to(FLAGS.device)
   pytorch_total_params = sum(p.numel() for p in model.parameters())
   print("Number params: ", pytorch_total_params)
   moving_avg = MovingAverage(FLAGS.validation_window_size, FLAGS.validation_metric != 'loss')
   pyg_graph = train_data.get_pyg_graph(FLAGS.device)
   optimizer = torch.optim.Adam(model.parameters(), lr=FLAGS.lr, )

   epoch_losses = []
   losses = []
   for epoch in range(FLAGS.num_epochs):
       t = time.time()
       model.train()
       model.zero_grad()
   	# Memory starts to increase here until it crashes
       loss, preds_train = model(pyg_graph, train_data)
       loss.backward()
       optimizer.step()

       loss = loss.item()
       with torch.no_grad():
           val_loss, preds_val = model(pyg_graph, val_data)
           val_loss = val_loss.item()
           eval_res_val = eval(preds_val, val_data, False)
           epoch_losses.append(val_loss)
           losses.append(loss)
       
   best_model = saver.load_trained_model(train_data)
   return best_model, model

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

rusty1scommented, May 20, 2021

Yes, you still need to use the edge list for model.recon_loss, but this does not result in a higher memory-footprint as you are required to compute edge representations anyway:

z = model.encode(x, adj_t)
col, row, _ = adj_t.coo()
loss = model.recon_loss(z, torch.stack([row, col], dim=0)

You can reduce the memory footprint by only reconstructing a subset of positive edges though, for example via:

z = model.encode(x, adj_t)
col, row, _ = adj_t.coo()
perm = torch.randperm(col.size(0))[:1000]  # Only reconstruct 1000 random samples
loss = model.recon_loss(z, torch.stack([row[perm], col[perm]], dim=0)

0reactions

anuradhawickcommented, May 20, 2021

Thanks heaps for the suggestion! Cheers!

Top Results From Across the Web

Memory leak during GCN training · Issue #1702 - GitHub

I'm currently training a 2-layer GCN model. I noticed that the memory needed for creating the model in each epoch is increasing heavily...

memory leak in keras while training a GAN - Stack Overflow

There is a known issue where a memory leak appears in TF 2.x keras when calling the network repeatedly in a loop.

Dealing with memory leak issue in Keras model training

Recently, I was trying to train my keras (v2.4.3) model with tensorflow-gpu (v2.2.0) backend on NVIDIA's Tesla V100-DGXS-32GB.

MVD: Memory-Related Vulnerability Detection Based on Flow ...

In this paper,we propose MVD, a statement-level Memory-related Vulnerability Detection approach based on flow-sensitive graph neural networks (FS-GNN). FS-GNN ...

Deep Graph Library - Deep Graph Library

Time out when lauching Distributed training ... Can the default implementation of GCN in DGL tutorial handle heterogeneous ... Memory leak in sparse.py?...