Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Cannot reproduce GPy with GPyTorch for 3D problem

See original GitHub issue

Hi!

I’m currently working on porting a simple GPy model to GPyTorch. It’s for a regression task, in which the input data (X) are coordinates in the range X \in [0, 7] and the targets are values in the range [-158, 3067]. At the moment I’m not interested in extrapolation, only interpolation.

The GPy model I’m using is defined as follows:

def build_model(X_data: np.ndarray, y_data:np.ndarray, domain: Domain, mode: int):
    """Return GP model given a set of data and a domain.
    """
    if mode == 1:
        k1 = GPy.kern.RBF(input_dim=len(domain))
        k2 = GPy.kern.Bias(input_dim=len(domain))
        kernel = k1 + k2
        kernel['rbf.lengthscale'].set_prior(GPy.priors.Gamma(a=1, b=2))
    else:
        raise ValueError(f'Unknown mode: {mode}')
    model = GPy.models.GPRegression(X_data, y_data, kernel, normalizer=GPy.util.normalizer.Standardize())
    model.optimize()
    _ = model.optimize_restarts(verbose=True)
    return model

This yields the following model gpy_model

and the predicted function values in the range [2.5,7] (since the function grows rapidly as it approaches zero) like this: gpy_pes

For the GPyTorch model, I’ve followed the ExactGP regression tutorial. I’ve modified the model slightly to perform z-scaling of the input values, due to large range of my y values (I do it for the X-positions as well, but that doesn’t seem to have an effect). Here is my model

import numpy as np
import torch
import gpytorch
import matplotlib.pyplot as plt

class SimpleGPyTorch(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, mode, standardize):
        super(SimpleGPyTorch, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        # Single RBF kernel, no bias kernel
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
        # self.covar_module = gpytorch.kernels.SpectralMixtureKernel(num_mixtures=4, ard_num_dims=3)
        self.mode = mode # either 1 for CPU or 2 for GPU
        self.standardize = standardize
        if standardize:
            # Save mean/std of input for standardization
            self.y_mean = train_y.mean()
            self.y_std = train_y.std()
            self.x_mean = train_x.mean(dim=0, keepdim=True)
            self.x_std = train_x.std(dim=0, keepdim=True)
            self.standardize_training_data()
        

    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean, covar)


    def optimize(self, training_iter=125, verbose=False, plot=True):
        model = self
        likelihood = self.likelihood
        X = self.train_inputs[0]
        y = self.train_targets

        model.train()
        likelihood.train()
        optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
        # optimizer = FullBatchLBFGS(model.parameters(), lr=0.01)

        mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
        loss_trace = []
        
        def closure():
            optimizer.zero_grad()
            output = model(X)
            loss = -mll(output, y) # reach MLE through gradient descent
            return loss
        with gpytorch.settings.cg_tolerance(0.01), gpytorch.settings.cg_tolerance(10000), gpytorch.settings.max_preconditioner_size(100):
            for i in range(training_iter):
                # Set gradients from previous iteration to 0
                loss = closure()
                loss.backward()
                optimizer.step()

                # options = {'closure': closure, 'current_loss': loss, 'max_ls': 20}
                # loss, _, _, _, _, _, _, fail = optimizer.step(options)
                if verbose and i%50 == 0:
                    print(f'Iteration {i} - Loss: {loss.item():.3f} - Lengthscale: {model.covar_module.base_kernel.lengthscale.item():.3f} - Noise: {model.likelihood.noise.item():.3f}')
                loss_trace.append(loss.detach().numpy())

                # if fail: 
                #     break
            if plot:
                _, ax = plt.subplots(figsize=(8,6))
                ax.set_xlabel("Training iteration")
                ax.set_ylabel("Marginal Log Likelihood Loss")
                ax.plot(loss_trace)
        return loss_trace


    def standardize_training_data(self):
        # Standardize targets
        self.train_targets -= self.y_mean
        self.train_targets /= self.y_std
        # Standardize features
        train_x = self.train_inputs[0]
        train_x -= self.x_mean
        train_x /= self.x_std
        self.train_inputs = (train_x,)


    def predict(self, x):
        self.eval()
        self.likelihood.eval()
                
        with torch.no_grad(), gpytorch.settings.fast_pred_var(): 
            x_pred = torch.from_numpy(x).type(torch.FloatTensor)

            if self.standardize:
                # Standardize prediction features
                x_pred = (x_pred - self.x_mean) / self.x_std
                
        
            prediction = self.likelihood(self(x_pred))
            mean = prediction.mean.detach().numpy()
            var = prediction.variance.detach().numpy()

            if self.standardize:
                # Rescale prediction to original training data scale
                original_mean = self.y_mean.detach().numpy()
                original_std = self.y_std.detach().numpy()
                mean = mean*original_std + original_mean
                var = var*original_std**2 # Variance is stationary and is only changed by a factor - https://github.com/scikit-learn/scikit-learn/blob/2beed55847ee70d363bdbfe14ee4401438fba057/sklearn/gaussian_process/_gpr.py#L355
            return mean, var

which yields the following training curve and model parameters. gpytorch_mll gpytorch_params

and the predicted y values: gpytorch

I’ve tried different optimizers (Adam, SGD, LBFGS from PyTorch as well as PyTorch-LBFGS), learning rates, using a gamma prior on the lengthscale, changing the kernel (to the GaussianMixture kernel), increasing the accuracy of the CG solves and the preconditioner size. I’ve also tried restarting training to avoid getting stuck in local minima, but in those case I’ve found that the model sometimes converges to predicting a flat function.

My guess is that the poor performance of the GPyTorch model is due to the model not converging very well during training. I see much better results when capping the range of y values to for instance [0, 50] instead of the full range, but this is not necessary for GPy.

As a sanity check, I’ve tested these exact models for both a toy problem (a noisy sine) as well as a simpler version of the problem above (2D positions instead of 3D), in which both GPy and GPyTorch agrees almost perfectly. For the 3D problem, I’ve also tried implementing the same model using scikit-learn, and there I get very similar results as for GPy. Is there something I’m missing?

Issue Analytics

State:
Created 2 years ago
Comments:13 (2 by maintainers)

Top GitHub Comments

1reaction

elindgrencommented, Sep 22, 2021

I think I managed to get it working. Using botorch in combination with setting a positive constraint on the noise and increasing the max cholesky size to be larger than the size of my dataset leads to very similar results to GPy. The hyperparameters are different than for GPy, but the predicted model is very similar:

z-benzene

I had also made an error when calculating the MAE and RMSE earlier, so they were closer before than what they seem. Now GPy and GPyTorch are almost identical.

mae-gpytorch

The problem with the model predicting a flat function was also an error that I had made: I had forgotten to copy my input tensors to the model, and since I perform z-scaling of both input data and targets on model creation meant that they both went to zero after a few models had been created (since the same tensors were erroneously used for all models).

@Balandat, @wjmaddox and @jacobrgardner, thank you for taking the time to help me investigate my problem! And thank you for a very nice and well documented gaussian process package!

0reactions

jacobrgardnercommented, Sep 21, 2021

@elindgren the lower MLL could be expected. GPy technically uses a noise lower bound of 1e-6 in the form of adding jitter when computing cholesky: https://github.com/SheffieldML/GPy/blob/3e19a85575687e37fd6f61174115d7c94d2c96e6/GPy/util/linalg.py#L65

Removing the lower bound in GPyTorch entirely will let you get potentially much smaller noises than 1e-6 (and therefore potentially lower losses), though at the potential cost of numerical instability as the conditioning of the kernel matrix gets worse. The NumericalWarnings with jitter of 1e-8 don’t seem too bad though: 1e-8 is still a really small amount of jitter.

If you want the warnings to go away, you could probably run everything in fp64 instead of torch’s default fp32 by adding torch.set_default_dtype(torch.float64) at the top of your script.

Alternatively, if you want to set the same lower bound, you can use a GreaterThan(1e-6) constraint.

It’s not surprising that the lengthscale gets smaller as the noise decreases: smaller lengthscales cause the GP fit to interpolate the training data more tightly, which makes sense in a lower noise regime.