Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

slow predictive posterior eval with `keops` + `fast_pred_var()`

See original GitHub issue

The posterior predictive evaluation with a keops kernel and fast_pred_var() is very slow. What am I doing wrong?

To reproduce

** Code snippet to reproduce (e.g. using Google Colab, GPU runtime) **

!pip install pykeops
!pip install gpytorch

import torch
import gpytorch
import pykeops

import time
import numpy as np

pykeops.verbose = True
pykeops.clean_pykeops()
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')  # i'm using the GPU

n_samples = 50000
n_samples_val = 25000
n_features = 10

def make_target(X):
  # y_train = 1 + x^2 + sin(x) + noise
  y = 1.0 + np.square(X).sum(axis=-1) + 0.1*np.random.randn(X.shape[0])
  return y.astype(np.float32)

# train data
X_train = np.random.randn(n_samples, n_features).astype(np.float32)
y_train = make_target(X_train)

# validation data
X_val = np.random.randn(n_samples_val, n_features).astype(np.float32)
y_val = make_target(X_val)

train_X = torch.as_tensor(X_train, device=device).contiguous()
train_y = torch.as_tensor(y_train, device=device).contiguous()

val_X = torch.as_tensor(X_val, device=device).contiguous()
val_y = torch.as_tensor(y_val, device=device).contiguous()

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.keops.MaternKernel(nu=2.5))
        # self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.MaternKernel(nu=2.5))

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood().to(device)
model = ExactGPModel(train_X, train_y, likelihood).to(device)

def eval_fit() -> float:
  # Get into evaluation (predictive posterior) mode
  model.eval()
  likelihood.eval()

  # Test points are regularly spaced along [0,1]
  # Make predictions by feeding model through likelihood
  start_time = time.time()
  with torch.no_grad(), gpytorch.settings.fast_pred_var():
    y_pred = likelihood(model(val_X))
    y_pred_mean = y_pred.mean.cpu().numpy()
  print(f'Eval - elapsed time: {(time.time() - start_time):.3f} sec ...')

  return np.sqrt(np.mean(np.square(y_pred_mean - y_val)))

model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(
    [
     {'params': model.parameters()},  # Includes GaussianLikelihood parameters
    ], 
    lr=0.01
)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

training_iter = 5

for i in range(training_iter):
    start_time = time.time()
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(train_X)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f   time:%.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item(),
        time.time() - start_time
    ))
    optimizer.step()

    eval_fit()
    model.train()
    likelihood.train()

Output

Compiling libKeOpstorch53bd9c5b1e in /root/.cache/pykeops-1.4.1-cpython-36/build-libKeOpstorch53bd9c5b1e:
       formula: Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))))) * Var(5,11,1)),0)
       aliases: Var(0,1,2); Var(1,10,0); Var(2,10,1); Var(3,1,2); Var(4,1,2); Var(5,11,1); 
       dtype  : float32
... Done.
Compiling libKeOpstorch308f0e2d0d in /root/.cache/pykeops-1.4.1-cpython-36/build-libKeOpstorch308f0e2d0d:
       formula: Grad_WithSavedForward(Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))))) * Var(5,11,1)),0), Var(1,10,0), Var(6,11,0), Var(7,11,0))
       aliases: Var(0,1,2); Var(1,10,0); Var(2,10,1); Var(3,1,2); Var(4,1,2); Var(5,11,1); Var(6,11,0); Var(7,11,0); 
       dtype  : float32
... Done.
Compiling libKeOpstorchaeed646587 in /root/.cache/pykeops-1.4.1-cpython-36/build-libKeOpstorchaeed646587:
       formula: Grad_WithSavedForward(Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))))) * Var(5,11,1)),0), Var(2,10,1), Var(6,11,0), Var(7,11,0))
       aliases: Var(0,1,2); Var(1,10,0); Var(2,10,1); Var(3,1,2); Var(4,1,2); Var(5,11,1); Var(6,11,0); Var(7,11,0); 
       dtype  : float32
... Done.
Iter 1/5 - Loss: 20.431   lengthscale: 0.693   noise: 0.693   time:118.198
Compiling libKeOpstorch5855ba4b6c in /root/.cache/pykeops-1.4.1-cpython-36/build-libKeOpstorch5855ba4b6c:
       formula: Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))))) * Var(5,1,1)),0)
       aliases: Var(0,1,2); Var(1,10,0); Var(2,10,1); Var(3,1,2); Var(4,1,2); Var(5,1,1); 
       dtype  : float32
... Done.
Compiling libKeOpstorchebbd8d09bd in /root/.cache/pykeops-1.4.1-cpython-36/build-libKeOpstorchebbd8d09bd:
       formula: Sum_Reduction(((((Var(0,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))) + (IntCst(1) + (Var(3,1,2) * Square(Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1))))))))) * Exp((Var(4,1,2) * Sqrt(Sum(Square((Var(1,10,0) - Var(2,10,1)))))))) * Var(5,100,1)),0)
       aliases: Var(0,1,2); Var(1,10,0); Var(2,10,1); Var(3,1,2); Var(4,1,2); Var(5,100,1); 
       dtype  : float32
... Done.
Eval - elapsed time: 80.717 sec ...
Iter 2/5 - Loss: 19.944   lengthscale: 0.698   noise: 0.698   time:5.715
Eval - elapsed time: 14.651 sec ...
Iter 3/5 - Loss: 19.469   lengthscale: 0.703   noise: 0.703   time:5.717
Eval - elapsed time: 14.667 sec ...
Iter 4/5 - Loss: 19.006   lengthscale: 0.708   noise: 0.708   time:5.695
Eval - elapsed time: 14.709 sec ...
Iter 5/5 - Loss: 18.556   lengthscale: 0.713   noise: 0.713   time:5.707
Eval - elapsed time: 14.928 sec ...

Expected Behavior

Predictive posterior eval should be (much) faster. I would have expected the eval time to be close to the time taken for a single iteration. Instead, I’m looking at 6 vs. 15 seconds, respectively (except for the 1st iteration, of course). If I batch the predictions (with a batch size of, say, 4096) then the first batch takes ca. 14 sec, while the others complete in a fraction of a second.

System information

Please complete the following information:

GPyTorch Version 1.1.1
PyTorch Version 1.5.1
Computer OS: Linux, CUDA 10.2

Additional context

Best way to reproduce this is to run the entire code in a Google Colab (use a GPU runtime).

Issue Analytics

State:
Created 3 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

gpleisscommented, Sep 1, 2020

What if I increase the tolerance of the linear solver (to reduce the CG iteration count during the prediction loop) - should I expect a significant deterioration in the predicted means and variances?

There will be some loss in accuracy, but that is of course dependent on the dataset. This is probably a good mechanism for speeding up cross validation though! You can wrap your validation loop with this context manager:

with gpytorch.settings.eval_cg_tolerance(1e-3):
    # Code

1e-3 is the default value. Changing it to 1e-2 might speed things up.

1reaction

gpleisscommented, Aug 24, 2020

Predictive posterior eval should be (much) faster. I would have expected the eval time to be close to the time taken for a single iteration.

Not necessarily. GPyTorch sets the tolerance of its iterative methods to be much tighter during prediction than during training. So it is likely that we’re running more CG iterations during the prediction loop.

If I batch the predictions (with a batch size of, say, 4096) then the first batch takes ca. 14 sec, while the others complete in a fraction of a second.

Yes. This is because we make caches of the large predictive computations that make subsequent computations much faster. The first time the prediction code is called, the cache is created. However, if the hyperparameters change (i.e. after a training iteration) then the cache is discarded because it has to be recomputed.

See https://arxiv.org/pdf/1903.08114.pdf - Section 3 paragraph “Predictions”

Top Results From Across the Web

GP Regression with LOVE for Fast Predictive Variances and ...

Without LOVE, we accomlish this by performing Cholesky on the posterior covariance matrix. This can be slow for large covariance matrices.

Sample_posterior_predictive slow performance due to ...

I am trying to sample from my posterior predictive distribution, but sample_posterior_predictive is taking a very long time (a few minutes) ...

Use of posterior predictive assessments to evaluate model fit ...

In this paper a comparison is made between four methods of model predictive assessment in the context of a three level logistic regression...

Bayesian Posterior Predictive Checks for Complex Models

First, they review the Bayesian approach to statistics and computation. Second, they discuss the evaluation of model fit in a bivariate probit model....

Posterior Predictive Distribution

by the prior distribution p(θ). So for some new data value xnew , averaging over p(θ) gives the prior predictive distribution: p(xnew )...