Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gradients of Kernels / backward functions

See original GitHub issue

here is the backward function of rbf_covariance from rbf_covariance.py. I am confused what happens in that function, could you give me a hand regrading this formula, thanks!

    @staticmethod
    def backward(ctx, grad_output):
        d_output_d_input = ctx.saved_tensors[0]
        lengthscale_grad = grad_output * d_output_d_input
return None, None, lengthscale_grad, None

second question, how can I use the rbf in order to get gradients w.r.t inputs of GP (see [1]), that means that GP stays unchangeable.

[1] - https://stats.stackexchange.com/questions/373446/computing-gradients-via-gaussian-process-regression

Issue Analytics

State:
Created 4 years ago
Comments:14 (1 by maintainers)

Top GitHub Comments

4reactions

jacobrgardnercommented, Jul 6, 2019

@cherepanovic Yes, you can forward propagate through a GP just like you would any PyTorch module, and when you call backward on a scalar you’ll get gradients with respect to any tensors that require grad that were involved in the computation of said scalar. In general, there are lots of great PyTorch tutorials on autograd mechanisms.

In your specific example, there are a few minor GPyTorch specific issues to keep in mind. First, when you backpropagate through a GP posterior in GPyTorch, you’ll want to be conscious of the fact that we compute caches for test time computation that you’ll want to clear each time through the model (since these caches explicitly assume the parameters aren’t changing). Second, as the output of a GP is a distribution and not a tensor, your arrow going from the GP to the second NN will actually need to be some operation that gives you a tensor, like sampling from the GP posterior.

Here’s a full example that does something like what you want:

### Step 1: Define the GP. Here, we assume the GP takes the first NN as input and represents the full NN -> GP part of the model.
class GPRegressionModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, feature_extractor):
        # Batch shape 5 means we actually want 5 GPs in this "layer"
        batch_shape = torch.Size([5])
        
        train_x = train_x.expand(*batch_shape, *train_x.shape)
        train_y = train_y.expand(*batch_shape, *train_y.shape)
        
        super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)
        
        
        self.mean_module = gpytorch.means.ZeroMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(batch_shape=batch_shape), batch_shape=batch_shape)

        self.feature_extractor = feature_extractor

    def forward(self, x):
        x = self.feature_extractor(x)
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

    
### Step 2: Define the GP -> NN part of the model. Forward here is made slightly complicated by the fact that when training
#   a model like this, we need to be sure to clear the GP test time caches each time.
class GPNN(torch.nn.Module):
    def __init__(self, gp, postprocess_nn):
        super().__init__()
        self.gp = gp
        self.postprocess_nn = postprocess_nn
        
    def forward(self, x):
        if self.training:
            # The next three lines are required to clear the GP test time caches since the GP parameters will change
            # each time
            with gpytorch.settings.detach_test_caches(False):
                self.gp.train()
                self.gp.eval()
                gp_output = self.gp(x)
        else:
            # If we aren't in training mode, we don't expect the GP parameters to change each iteration
            # so we don't need to clear the caches.
            gp_output = self.gp(x)
        
        # f_samples will be 10 x 5 x n in this example
        f_samples = gp_output.rsample(torch.Size([10]))
        
        # Transpose to be 10 x n x 5
        f_samples = f_samples.transpose(-2, -1)
        
        output = self.postprocess_nn(f_samples)
        output = output.mean(0).squeeze(-1)  # Average over GP sample dimension
        
        return output

### Step 3: Plug the pieces together: We define a simple feature extractor, the GP, and the postprocessing NN and combine them.
feature_extractor = torch.nn.Sequential(
    torch.nn.Linear(train_x.size(-1), 10),
    torch.nn.ReLU(),
    torch.nn.Linear(10, 10),
    torch.nn.ReLU(),
)

gp = GPRegressionModel(
    train_x,
    train_y,
    gpytorch.likelihoods.GaussianLikelihood(),
    feature_extractor=feature_extractor
)

postprocess_nn = torch.nn.Sequential(
    torch.nn.Linear(5, 5),   # 5 because the GP layer has 5 GP outputs
    torch.nn.ReLU(),
    torch.nn.Linear(5, 1),
)

model = GPNN(gp, postprocess_nn)


# Compute mean squared error through this model and look at a gradient
mse = (model(train_x) - train_y).pow(2).mean()
mse.backward()
print(feature_extractor[0].weight.grad)

2reactions

KeAWangcommented, Jul 3, 2019

Oh actually for this special case since the derivative of the GP at each point depends only on the corresponding test point, you could do something like this instead

dydtest_x_fast = torch.autograd.grad(observed_pred.mean.sum(), test_x)[0]

Speed difference:

Top Results From Across the Web

Lecture 13: Kernels - Cornell CS

Gradient Descent with Squared Loss. The kernel trick is a way to get around this dilemma by learning a function in the much...

Writing custom ops, kernels and gradients in TensorFlow.js

Gradients are 'high level' code (not backend specific) and can call other ops or kernels.

python - Manually/explicitly calculate gradients of Conv kernels

In a neural network I constructed, it is needed to calculate the gradients of some features and Conv kernels with respect to my...

Backwards Pass for Convolution Layer

Convolution between upstream gradient and kernel! (can implement by flipping kernel and cross- correlation). Again, all operations can be implemented ...

Gradient estimates for heat kernels and harmonic functions

(ii) (RH_p): L^p-reverse Hölder inequality for the gradients of harmonic functions; (iii) (R_p): L^p-boundedness of the Riesz transform ...