question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

[Bug] Gradient computation with KeOps kernel only works with low number of training data, fails otherwise

See original GitHub issue

šŸ› Bug

Hi,

First of all, I would like to extend my thanks to all the developers for all the efforts you have put into both the research and this great package.

Consider a GP with a KeOps kernel (e.g. gpytorch.kernels.keops.RBFKernel). If I train it with N=100 number of points, then the gradient of predictive mean can be obtained by torch.autograd.grad or .backward(), but set N=500 and an error will be thrown that the input tensor was not used in the graph. I have tested the script on two separate machines and a colab instance. Using the GPyTorch standard kernels will not run into this issue. I spent a good deal of time pinpointing what was wrong from bigger chunks of code and this seemed to be the issue. The gradient link seems to cut off between the covariance output of the KeOps kernel and the input (covar.x1).

I have provided a minimal code right below that should quickly give you an idea of this somewhat strange behavior. It contains two test cases with N=100 (which passes) and 500 (fails). The code is from GPyTorch regression examples, I have only added the GP kernel and a few lines for calculating gradients at the end.

To reproduce

import math
import torch
import gpytorch
import time

# We will use the simplest form of GP model, exact inference with gpytorch.kernels.keops.RBFKernel
class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.keops.RBFKernel())

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

def train_and_eval_GP(N = 100):
    """
    inputs:
    N (int): Number of training points
    """
    # make train/val/test
    # Training data is 100 points in [0,1] inclusive regularly spaced
    train_x = torch.linspace(0, 1, N)
    # True function is sin(2*pi*x) with Gaussian noise
    train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * math.sqrt(0.04)
    # normalize features
    mean = train_x.mean()
    std = train_x.std() + 1e-6 # prevent dividing by 0
    train_x = (train_x - mean) / std

    # normalize labels
    mean, std = train_y.mean(),train_y.std()
    train_y = (train_y - mean) / std

    # make continguous
    train_x, train_y = train_x.contiguous(), train_y.contiguous()

    output_device = torch.device('cuda:0')

    train_x, train_y = train_x.to(output_device), train_y.to(output_device)

    # initialize likelihood and model
    likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
    model = ExactGPModel(train_x, train_y, likelihood).to(output_device)

    # Find optimal model hyperparameters
    model.train()
    likelihood.train()

    # Use the adam optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

    # "Loss" for GPs - the marginal log likelihood
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

    training_iter = 20
    for i in range(training_iter):
        # Zero gradients from previous iteration
        optimizer.zero_grad()
        # Output from model
        output = model(train_x)
        # Calc loss and backprop gradients
        loss = -mll(output, train_y)
        loss.backward()
        optimizer.step()
    print(f'GP model trained.')

    # Get into evaluation (predictive posterior) mode
    model.eval()
    likelihood.eval()

    # Test points are regularly spaced along [0,1]
    test_x = torch.linspace(0, 1, 51, requires_grad=True).to(output_device).contiguous()
    
    # Make predictions by feeding model through likelihood
    with gpytorch.settings.fast_pred_var():
        observed_pred = likelihood(model(test_x))
        assert torch.autograd.grad(observed_pred.mean.sum(), test_x, retain_graph=True) is not None
        print('gradient exists:')
        print(torch.autograd.grad(observed_pred.mean.sum(), test_x, retain_graph=True))

if __name__ == "__main__":
    Ns = [100, 500] #test cases
    for n in Ns:
        try:
            print(f'testing with {n} points...')
            train_and_eval_GP(N = n) 
            print('success!')
        except Exception as e:
            print('failed.')
            print(e)

** Stack trace/error message **

testing with 100 points...
GP model trained.
gradient exists:
(tensor([-2.6629e+00, -2.6507e+00, -2.6344e+00, -2.6138e+00, -2.5891e+00,
        ...,
         9.1844e-01], device='cuda:0'),)
success!
testing with 500 points...
GP model trained.
failed.
One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Expected Behavior

Setting the kernel to the standard, non-KeOps kernel (gpytorch.kernels.RBFKernel) we get the gradients for the second case, shown below. However, I canā€™t simply use it since Iā€™m working on a larger dataset that will run out of memory if I do so.

testing with 100 points...
GP model trained.
gradient exists:
(tensor([-2.5885, -2.5870, -2.5819, -2.5732, -2.5609, -2.5449, -2.5254, -2.5022,
        ...,
         0.6977,  0.7880,  0.8762], device='cuda:0'),)
success!
testing with 500 points...
GP model trained.
gradient exists:
(tensor([-2.4751, -2.4741, -2.4698, -2.4622, -2.4514, -2.4373, -2.4199, -2.3993,
        ...,
         0.4727,  0.5600,  0.6462], device='cuda:0'),)
success!

System information

Please complete the following information:

  • GPyTorch 1.6.0
  • PyTorch 1.10.1
  • KeOps 1.5
  • Ubuntu 20.04.3 LTS

Additional context

I know that the issue might be unrelated to GPyTorch as this clearly stems from the keops kernel. It is however difficult to track, so I thought Iā€™d report it here.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12

github_iconTop GitHub Comments

1reaction
wjmaddoxcommented, Feb 3, 2022

So, this snippet should work:

test_x = torch.linspace(0, 1, 51, device=output_device).requires_grad_()

# Make predictions by feeding model through likelihood
with gpytorch.settings.max_cholesky_size(50),  gpytorch.settings.fast_pred_var():
    observed_pred = likelihood(model(test_x))
    assert torch.autograd.grad(observed_pred.mean.sum(), test_x, retain_graph=True) is not None
    print('gradient exists:')
    print(torch.autograd.grad(observed_pred.mean.sum(), test_x, retain_graph=True))

I havenā€™t quite figured out what the specific issue in https://github.com/cornellius-gp/gpytorch/blob/5f54dbf1fb514fa5bf31d2468f1d85b2fb66a44f/gpytorch/kernels/keops/rbf_kernel.py#L23 is, but the issue is that below in L34-36, keops is (properly) not used when at least one of the kernel inputs is beneath the max cholesky size. Enforcing a keops forwards does solve the issue allowing gradients to propagate.

Iā€™ll put up a PR once I figure out whatā€™s getting detached.

edit: Iā€™m pretty sure the reason why eager_kernel_size also resolves the gradient issues is because it enforces no keops to be used at all as you probably donā€™t want to be using keops on small datasets due to speed issues.

1reaction
wjmaddoxcommented, Feb 3, 2022

Sorry for losing track of this in my icml crunch, but I took a look at it today and figured out that it doesnā€™t occur when the scale kernel is not included, e.g. covar_module = gpytorch.kernels.keops.RBFKernel() so hopefully thatā€™s progress towards figuring out whatā€™s going on. Iā€™ll keep looking tomorrow.

edit: the unscaled covariance fails but thatā€™s due to a clear error message and goes away when n_train > 800 or the max cholesky size threshold.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kernel Operations on the GPU, with autodiff, without memory ...
Kernel Operations on the GPU, with autodiff, without memory overflows . The KeOps library lets you compute reductions of large arrays whoseĀ ......
Read more >
Implementing A Custom Kernel ā€” falkon 0.7.5 documentation
We must inherit from the falkon.kernels.Kernel class, and implement: - compute method: the core of the kernel implementation. Given two input matrices (of...
Read more >
Low-Precision Arithmetic for Fast Gaussian Processes
Abstract: Low precision arithmetic has had a transformative effect on the training of neural networks, reducing computation, memory and energy requirements.
Read more >
Kernel methods through the roof: handling billions of points ...
Abstract. Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems,Ā ......
Read more >
gpytorch 1.9.0 - PythonFix.com
[Bug] Gradient computation with KeOps kernel only works with low number of training data, fails otherwise; [Bug] Error in the "initialize" method ofĀ ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found