question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cuda Memory overflow during predictions

See original GitHub issue

I am running a simple multitask GP on an NVIDIA P6000 GPU (with mem = 24gb). Following is my code

GP Model:

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=128
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=128, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

Model Import:

from gp_models import MultitaskGPModel
inputs,targets = fine_tune_data() #inputs and target size [7200,128]
gp_model = MultitaskGPModel(inputs,targets,likelihood).cuda()
# gp_model = torch.nn.DataParallel(gp_model)
gp_model.set_train_data(inputs,targets,strict = False)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, gp_model)
gp_optimizer = opt.optimizer([
    {'params': gp_model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.02)

Training:

def fine_tune_train():
    gp_model.train()
    likelihood.train()
    n_iter = 50
    for i in range(n_iter):
        gp_optimizer.zero_grad()

        output = gp_model(inputs)
        loss = -mll(output, targets)
        loss.backward(retain_graph=True)
        print('Iter %d/%d - Loss: %.3f' % (i + 1, n_iter, loss.item()))
        gp_optimizer.step()

However when prediction is made using the gp_model

prediction = likelihood(gp_model(test)) #test_size = [80,128]

Memory of GPU overflows. Can someone help me to identify the memory leak?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Balandatcommented, Apr 16, 2019

You have 128 tasks - with the 7200 train points and the 80 test points, your full joint covariance matrix is n x n with n = 128 * (7200 + 80) ~ 1M. This is your memory leak.

0reactions
wanghuzhencommented, Apr 14, 2022

I have the same question. I run a simple multitask GP on an NVIDIA 3090ti GPU (with mem = 24gb). The shape of input data is (3500,42) and the shape of output data is (3500,3). Use the above data to train the model, then make the prediction by the trained model and the memory will be consumed quickly until OOM. 856fd7d16da6d66837b3aa6225abaca The memory during training.

bb457bbe343d8715231ad70993bfcf4 The memory during making prediction. Then the memory will maintain the state of 13.2GB. I tried to use torch.cuda_empty() to release the memory, but it didn’t work. torch.cuda_empty() can only release half of the memory.

Following my code: GP model:

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood,):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.num_tasks = train_y.shape[-1]
        self.input_features = train_x.shape[-1]
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=self.num_tasks
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=self.num_tasks, rank=1
        )
        self.linear1 = torch.nn.Linear(self.input_features, self.input_features)

    def forward(self, x):
        x = self.linear1(x)
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

Training:

likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(
    num_tasks=3).to(device)
model = MultitaskGPModel(train_x, train_y, likelihood).to(device)
model.double()
likelihood.train()
model.train()
# Use the adam optimizer
# Includes GaussianLikelihood parameters
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(
    likelihood, model)
for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' %
            (i + 1, training_iterations, loss.item()))
    optimizer.step()

Prediction:

model.eval()
likelihood.eval()
# Make predictions
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()

The version is: gpytorch:1.6.0; python:3.9.7; pytorch:1.10.0

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolving CUDA Being Out of Memory With Gradient ...
Implementing gradient accumulation and automatic mixed precision to solve CUDA out of memory issue when training big deep learning models ...
Read more >
Memory overflow issue in CUDA when trying to make a ...
Memory overflow issue in CUDA when trying to make a prediction for a trained model - nlp - PyTorch Forums.
Read more >
CUDA out of memory - nlp - Stack Overflow
I am getting error when trying to run BERT model for NER task. "CUDA out of memory. Tried to allocate 20.00 MiB (GPU...
Read more >
Cuda out of memory during evaluation but training is fine
Hi, I am finetuning a BARTForConditionalGeneration model. I am using Trainer from the library to train so I do not use anything fancy....
Read more >
Solving "CUDA out of memory" Error - Kaggle
I can see the VRAM usage bouncing around now so this is definitely working to some extent. Batch_size of 8 still causes an...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found