[Question] Cannot reproduce GPy with GPyTorch for 3D problem
See original GitHub issueHi!
I’m currently working on porting a simple GPy model to GPyTorch. It’s for a regression task, in which the input data (X) are coordinates in the range X \in [0, 7] and the targets are values in the range [-158, 3067]. At the moment I’m not interested in extrapolation, only interpolation.
The GPy model I’m using is defined as follows:
def build_model(X_data: np.ndarray, y_data:np.ndarray, domain: Domain, mode: int):
"""Return GP model given a set of data and a domain.
if mode == 1:
k1 = GPy.kern.RBF(input_dim=len(domain))
k2 = GPy.kern.Bias(input_dim=len(domain))
kernel = k1 + k2
kernel['rbf.lengthscale'].set_prior(GPy.priors.Gamma(a=1, b=2))
raise ValueError(f'Unknown mode: {mode}')
model = GPy.models.GPRegression(X_data, y_data, kernel, normalizer=GPy.util.normalizer.Standardize())
_ = model.optimize_restarts(verbose=True)
return model
This yields the following model
and the predicted function values in the range [2.5,7] (since the function grows rapidly as it approaches zero) like this:
For the GPyTorch model, I’ve followed the ExactGP regression tutorial. I’ve modified the model slightly to perform z-scaling of the input values, due to large range of my y values (I do it for the X-positions as well, but that doesn’t seem to have an effect). Here is my model
import numpy as np
import torch
import gpytorch
import matplotlib.pyplot as plt
class SimpleGPyTorch(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood, mode, standardize):
super(SimpleGPyTorch, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
# Single RBF kernel, no bias kernel
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
# self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4, ard_num_dims=3)
self.mode = mode # either 1 for CPU or 2 for GPU
self.standardize = standardize
if standardize:
# Save mean/std of input for standardization
self.y_mean = train_y.mean()
self.y_std = train_y.std()
self.x_mean = train_x.mean(dim=0, keepdim=True)
self.x_std = train_x.std(dim=0, keepdim=True)
def forward(self, x):
mean = self.mean_module(x)
covar = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean, covar)
def optimize(self, training_iter=125, verbose=False, plot=True):
model = self
likelihood = self.likelihood
X = self.train_inputs[0]
y = self.train_targets
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
# optimizer = FullBatchLBFGS(model.parameters(), lr=0.01)
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
loss_trace = []
def closure():
output = model(X)
loss = -mll(output, y) # reach MLE through gradient descent
return loss
with gpytorch.settings.cg_tolerance(0.01), gpytorch.settings.cg_tolerance(10000), gpytorch.settings.max_preconditioner_size(100):
for i in range(training_iter):
# Set gradients from previous iteration to 0
loss = closure()
# options = {'closure': closure, 'current_loss': loss, 'max_ls': 20}
# loss, _, _, _, _, _, _, fail = optimizer.step(options)
if verbose and i%50 == 0:
print(f'Iteration {i} - Loss: {loss.item():.3f} - Lengthscale: {model.covar_module.base_kernel.lengthscale.item():.3f} - Noise: {model.likelihood.noise.item():.3f}')
# if fail:
# break
if plot:
_, ax = plt.subplots(figsize=(8,6))
ax.set_xlabel("Training iteration")
ax.set_ylabel("Marginal Log Likelihood Loss")
return loss_trace
def standardize_training_data(self):
# Standardize targets
self.train_targets -= self.y_mean
self.train_targets /= self.y_std
# Standardize features
train_x = self.train_inputs[0]
train_x -= self.x_mean
train_x /= self.x_std
self.train_inputs = (train_x,)
def predict(self, x):
with torch.no_grad(), gpytorch.settings.fast_pred_var():
x_pred = torch.from_numpy(x).type(torch.FloatTensor)
if self.standardize:
# Standardize prediction features
x_pred = (x_pred - self.x_mean) / self.x_std
prediction = self.likelihood(self(x_pred))
mean = prediction.mean.detach().numpy()
var = prediction.variance.detach().numpy()
if self.standardize:
# Rescale prediction to original training data scale
original_mean = self.y_mean.detach().numpy()
original_std = self.y_std.detach().numpy()
mean = mean*original_std + original_mean
var = var*original_std**2 # Variance is stationary and is only changed by a factor - https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/gaussian_process/_gpr.py#L355
return mean, var
which yields the following training curve and model parameters.
and the predicted y values:
I’ve tried different optimizers (Adam, SGD, LBFGS from PyTorch as well as PyTorch-LBFGS), learning rates, using a gamma prior on the lengthscale, changing the kernel (to the GaussianMixture kernel), increasing the accuracy of the CG solves and the preconditioner size. I’ve also tried restarting training to avoid getting stuck in local minima, but in those case I’ve found that the model sometimes converges to predicting a flat function.
My guess is that the poor performance of the GPyTorch model is due to the model not converging very well during training. I see much better results when capping the range of y values to for instance [0, 50] instead of the full range, but this is not necessary for GPy.
As a sanity check, I’ve tested these exact models for both a toy problem (a noisy sine) as well as a simpler version of the problem above (2D positions instead of 3D), in which both GPy and GPyTorch agrees almost perfectly. For the 3D problem, I’ve also tried implementing the same model using scikit-learn, and there I get very similar results as for GPy. Is there something I’m missing?
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (2 by maintainers)
Top GitHub Comments
I think I managed to get it working. Using botorch in combination with setting a positive constraint on the noise and increasing the max cholesky size to be larger than the size of my dataset leads to very similar results to GPy. The hyperparameters are different than for GPy, but the predicted model is very similar:
I had also made an error when calculating the MAE and RMSE earlier, so they were closer before than what they seem. Now GPy and GPyTorch are almost identical.
The problem with the model predicting a flat function was also an error that I had made: I had forgotten to copy my input tensors to the model, and since I perform z-scaling of both input data and targets on model creation meant that they both went to zero after a few models had been created (since the same tensors were erroneously used for all models).
@Balandat, @wjmaddox and @jacobrgardner, thank you for taking the time to help me investigate my problem! And thank you for a very nice and well documented gaussian process package!
@elindgren the lower MLL could be expected. GPy technically uses a noise lower bound of 1e-6 in the form of adding jitter when computing cholesky: https://github.com/SheffieldML/GPy/blob/3e19a85575687e37fd6f61174115d7c94d2c96e6/GPy/util/linalg.py#L65
Removing the lower bound in GPyTorch entirely will let you get potentially much smaller noises than 1e-6 (and therefore potentially lower losses), though at the potential cost of numerical instability as the conditioning of the kernel matrix gets worse. The NumericalWarnings with jitter of 1e-8 don’t seem too bad though: 1e-8 is still a really small amount of jitter.
If you want the warnings to go away, you could probably run everything in fp64 instead of torch’s default fp32 by adding
at the top of your script.Alternatively, if you want to set the same lower bound, you can use a
constraint.It’s not surprising that the lengthscale gets smaller as the noise decreases: smaller lengthscales cause the GP fit to interpolate the training data more tightly, which makes sense in a lower noise regime.