Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cuda Memory overflow during predictions

See original GitHub issue

I am running a simple multitask GP on an NVIDIA P6000 GPU (with mem = 24gb). Following is my code

GP Model:

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=128
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=128, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

Model Import:

from gp_models import MultitaskGPModel
inputs,targets = fine_tune_data() #inputs and target size [7200,128]
gp_model = MultitaskGPModel(inputs,targets,likelihood).cuda()
# gp_model = torch.nn.DataParallel(gp_model)
gp_model.set_train_data(inputs,targets,strict = False)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, gp_model)
gp_optimizer = opt.optimizer([
    {'params': gp_model.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.02)

Training:

def fine_tune_train():
    gp_model.train()
    likelihood.train()
    n_iter = 50
    for i in range(n_iter):
        gp_optimizer.zero_grad()

        output = gp_model(inputs)
        loss = -mll(output, targets)
        loss.backward(retain_graph=True)
        print('Iter %d/%d - Loss: %.3f' % (i + 1, n_iter, loss.item()))
        gp_optimizer.step()

However when prediction is made using the gp_model

prediction = likelihood(gp_model(test)) #test_size = [80,128]

Memory of GPU overflows. Can someone help me to identify the memory leak?

Issue Analytics

State:
Created 4 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

Balandatcommented, Apr 16, 2019

You have 128 tasks - with the 7200 train points and the 80 test points, your full joint covariance matrix is n x n with n = 128 * (7200 + 80) ~ 1M. This is your memory leak.

0reactions

wanghuzhencommented, Apr 14, 2022

I have the same question. I run a simple multitask GP on an NVIDIA 3090ti GPU (with mem = 24gb). The shape of input data is (3500,42) and the shape of output data is (3500,3). Use the above data to train the model, then make the prediction by the trained model and the memory will be consumed quickly until OOM. 856fd7d16da6d66837b3aa6225abaca The memory during training.

bb457bbe343d8715231ad70993bfcf4 The memory during making prediction. Then the memory will maintain the state of 13.2GB. I tried to use torch.cuda_empty() to release the memory, but it didn’t work. torch.cuda_empty() can only release half of the memory.

Following my code: GP model:

class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood,):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.num_tasks = train_y.shape[-1]
        self.input_features = train_x.shape[-1]
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=self.num_tasks
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=self.num_tasks, rank=1
        )
        self.linear1 = torch.nn.Linear(self.input_features, self.input_features)

    def forward(self, x):
        x = self.linear1(x)
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

Training:

likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(
    num_tasks=3).to(device)
model = MultitaskGPModel(train_x, train_y, likelihood).to(device)
model.double()
likelihood.train()
model.train()
# Use the adam optimizer
# Includes GaussianLikelihood parameters
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(
    likelihood, model)
for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f' %
            (i + 1, training_iterations, loss.item()))
    optimizer.step()

Prediction:

model.eval()
likelihood.eval()
# Make predictions
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    predictions = likelihood(model(test_x))
    mean = predictions.mean
    lower, upper = predictions.confidence_region()

The version is: gpytorch:1.6.0; python:3.9.7; pytorch:1.10.0