[Bug] Gradient computation with KeOps kernel only works with low number of training data, fails otherwise
See original GitHub issueš Bug
Hi,
First of all, I would like to extend my thanks to all the developers for all the efforts you have put into both the research and this great package.
Consider a GP with a KeOps kernel (e.g. gpytorch.kernels.keops.RBFKernel). If I train it with N=100 number of points, then the gradient of predictive mean can be obtained by torch.autograd.grad or .backward(), but set N=500 and an error will be thrown that the input tensor was not used in the graph. I have tested the script on two separate machines and a colab instance. Using the GPyTorch standard kernels will not run into this issue. I spent a good deal of time pinpointing what was wrong from bigger chunks of code and this seemed to be the issue. The gradient link seems to cut off between the covariance output of the KeOps kernel and the input (covar.x1).
I have provided a minimal code right below that should quickly give you an idea of this somewhat strange behavior. It contains two test cases with N=100 (which passes) and 500 (fails). The code is from GPyTorch regression examples, I have only added the GP kernel and a few lines for calculating gradients at the end.
To reproduce
import math
import torch
import gpytorch
import time
# We will use the simplest form of GP model, exact inference with gpytorch.kernels.keops.RBFKernel
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.keops.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
def train_and_eval_GP(N = 100):
"""
inputs:
N (int): Number of training points
"""
# make train/val/test
# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, N)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()) * math.sqrt(0.04)
# normalize features
mean = train_x.mean()
std = train_x.std() + 1e-6 # prevent dividing by 0
train_x = (train_x - mean) / std
# normalize labels
mean, std = train_y.mean(),train_y.std()
train_y = (train_y - mean) / std
# make continguous
train_x, train_y = train_x.contiguous(), train_y.contiguous()
output_device = torch.device('cuda:0')
train_x, train_y = train_x.to(output_device), train_y.to(output_device)
# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
model = ExactGPModel(train_x, train_y, likelihood).to(output_device)
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # Includes GaussianLikelihood parameters
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
training_iter = 20
for i in range(training_iter):
# Zero gradients from previous iteration
optimizer.zero_grad()
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
optimizer.step()
print(f'GP model trained.')
# Get into evaluation (predictive posterior) mode
model.eval()
likelihood.eval()
# Test points are regularly spaced along [0,1]
test_x = torch.linspace(0, 1, 51, requires_grad=True).to(output_device).contiguous()
# Make predictions by feeding model through likelihood
with gpytorch.settings.fast_pred_var():
observed_pred = likelihood(model(test_x))
assert torch.autograd.grad(observed_pred.mean.sum(), test_x, retain_graph=True) is not None
print('gradient exists:')
print(torch.autograd.grad(observed_pred.mean.sum(), test_x, retain_graph=True))
if __name__ == "__main__":
Ns = [100, 500] #test cases
for n in Ns:
try:
print(f'testing with {n} points...')
train_and_eval_GP(N = n)
print('success!')
except Exception as e:
print('failed.')
print(e)
** Stack trace/error message **
testing with 100 points...
GP model trained.
gradient exists:
(tensor([-2.6629e+00, -2.6507e+00, -2.6344e+00, -2.6138e+00, -2.5891e+00,
...,
9.1844e-01], device='cuda:0'),)
success!
testing with 500 points...
GP model trained.
failed.
One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Expected Behavior
Setting the kernel to the standard, non-KeOps kernel (gpytorch.kernels.RBFKernel) we get the gradients for the second case, shown below. However, I canāt simply use it since Iām working on a larger dataset that will run out of memory if I do so.
testing with 100 points...
GP model trained.
gradient exists:
(tensor([-2.5885, -2.5870, -2.5819, -2.5732, -2.5609, -2.5449, -2.5254, -2.5022,
...,
0.6977, 0.7880, 0.8762], device='cuda:0'),)
success!
testing with 500 points...
GP model trained.
gradient exists:
(tensor([-2.4751, -2.4741, -2.4698, -2.4622, -2.4514, -2.4373, -2.4199, -2.3993,
...,
0.4727, 0.5600, 0.6462], device='cuda:0'),)
success!
System information
Please complete the following information:
- GPyTorch 1.6.0
- PyTorch 1.10.1
- KeOps 1.5
- Ubuntu 20.04.3 LTS
Additional context
I know that the issue might be unrelated to GPyTorch as this clearly stems from the keops kernel. It is however difficult to track, so I thought Iād report it here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:12
Top GitHub Comments
So, this snippet should work:
I havenāt quite figured out what the specific issue in https://github.com/cornellius-gp/gpytorch/blob/5f54dbf1fb514fa5bf31d2468f1d85b2fb66a44f/gpytorch/kernels/keops/rbf_kernel.py#L23 is, but the issue is that below in L34-36, keops is (properly) not used when at least one of the kernel inputs is beneath the max cholesky size. Enforcing a keops forwards does solve the issue allowing gradients to propagate.
Iāll put up a PR once I figure out whatās getting detached.
edit: Iām pretty sure the reason why eager_kernel_size also resolves the gradient issues is because it enforces no keops to be used at all as you probably donāt want to be using keops on small datasets due to speed issues.
Sorry for losing track of this in my icml crunch, but I took a look at it today and figured out that it doesnāt occur when the scale kernel is not included, e.g.
covar_module = gpytorch.kernels.keops.RBFKernel()
so hopefully thatās progress towards figuring out whatās going on. Iāll keep looking tomorrow.edit: the unscaled covariance fails but thatās due to a clear error message and goes away when
n_train > 800
or the max cholesky size threshold.