question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

[Bug?] Much larger variance with MultiTask kernel compared to Independent Model List

See original GitHub issue

šŸ› Bug

On training a MultiTask kernel based model and a collection of independent models tied together in an independent model list object on the same dataset, I see variance magnitudes that are orders of magnitude different. It is unclear why this is the case since the model parameters common to the 2 learnt models (The MultiTask model MTmodel and the Independent Model List model) seem to quite similar.

To reproduce

** Code snippet to reproduce **

import torch
import gpytorch
import numpy as np


class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=4))

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)


class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        # https://docs.gpytorch.ai/en/stable/means.html#multitaskmean
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=2
        )
        # Composition of index kernel and RBF kernel
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(ard_num_dims=4), num_tasks=2, rank=2
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        # https://docs.gpytorch.ai/en/stable/distributions.html#multitaskmultivariatenormal
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x, interleaved=False)


# Number of train samples
nsamps = 1000
# Fix seed for reproducability
np.random.seed(10)
# Joint input space tensor (u_t, x_t) to hold all inputs and trajectories
train_x = torch.tensor(np.random.uniform(low=-1.0, high=1.0, size=(nsamps, 4))).float()

# Generate output samples
# A and B matrices
A = torch.tensor([[1., 0.],
                  [0., 1.]])
B = torch.tensor([[-0.2, 0.1],
                 [0.15, 0.15]])
# Output states starting time index 1, no observation noise
train_y = torch.zeros(nsamps, 2)
# Obtain the output states $(x_{t+1, 1}, x_{t+1, 2})$
for i in range(nsamps):
    # Get the output next state
    x_next = torch.matmul(A, train_x[i, 2:4]) + torch.matmul(B, train_x[i, 0:2])
    # No observation noise added
    train_y[i, :] = x_next

# dataset = torch.cat([train_x, train_y], dim=1)

likelihood1 = gpytorch.likelihoods.GaussianLikelihood()
model1 = ExactGPModel(train_x, train_y[:, 0], likelihood1)

likelihood2 = gpytorch.likelihoods.GaussianLikelihood()
model2 = ExactGPModel(train_x, train_y[:, 1], likelihood2)

# Collect the sub-models in an IndependentMultiOutputGP, and the respective likelihoods in a MultiOutputLikelihood
model = gpytorch.models.IndependentModelList(model1, model2)
likelihood = gpytorch.likelihoods.LikelihoodList(model1.likelihood, model2.likelihood)

mll = gpytorch.mlls.SumMarginalLogLikelihood(likelihood, model)

# Perform Ind. Model List Training
training_iterations = 50
# Find optimal model hyper-parameters
model.train()
likelihood.train()
# Use the Adam optimizer
optimizer = torch.optim.Adam([
    {'params': model.parameters()},  # Includes all submodel and all likelihood parameters
], lr=0.1)
print("Training Ind. Model List\n- - - - - - - - - - ")
for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(*model.train_inputs)
    loss = -mll(output, model.train_targets)
    loss.backward()
    if (i + 1) % 5 == 0:
        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
    optimizer.step()
print("- - - - - - - - - - ")

# MTGaussianLikelihood allows for modelling a full 2x2 Noise Cov. Prior
MTlikelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
MTmodel = MultitaskGPModel(train_x, train_y, MTlikelihood)

training_iterations = 50

# Find optimal MTmodel hyperparameters
MTmodel.train()
MTlikelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam([
    {'params': MTmodel.parameters()},  # Includes GaussianLikelihood parameters
], lr=0.1)

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(MTlikelihood, MTmodel)
print("Training MT Model\n- - - - - - - - - - ")
for i in range(training_iterations):
    optimizer.zero_grad()
    output = MTmodel(train_x)
    loss = -mll(output, train_y)
    loss.backward()
    if (i + 1) % 5 == 0:
        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
    optimizer.step()
print("- - - - - - - - - - ")

# View the parameters (and others specific to MT)-
# (1) Learned value of observation-noise covariance
# (2) Learned constant mean prior
# (3) Learned kernel scale parameter (\sigma)
# (4) Learned kernel length scale (\ell)

# Output 1 y_{a}
print("- - - - - - - - - \nModel 1a\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[0].likelihood.noise_covar.noise)
print("Learned constant mean for the prior")
print(model.models[0].mean_module.constant)
print("Learned kernel scale (variance of kernel sigma)")
print(model.models[0].covar_module.outputscale)
print("Learned kernel length scales (one for each input) \ell")
print(model.models[0].covar_module.base_kernel.lengthscale)

# Output 2 y_{b}
print("- - - - - - - - - \nModel 1b\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[1].likelihood.noise_covar.noise)
print("Learned constant mean for the prior")
print(model.models[1].mean_module.constant)
print("Learned kernel scale (variance of kernel sigma)")
print(model.models[1].covar_module.outputscale)
print("Learned kernel length scales (one for each input) \ell")
print(model.models[1].covar_module.base_kernel.lengthscale)

# MT Model
print("- - - - - - - - - \nModel 2 (MultiTask=MT)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(MTmodel.likelihood.noise)
print("Learned constant mean for the prior comp. 1")
print(MTmodel.mean_module.base_means[0].constant)
print("Learned constant mean for the prior comp. 2")
print(MTmodel.mean_module.base_means[1].constant)
print("Learned static Index Kernel/Fixed Covariance K_{TT} matrix")
print(MTmodel.covar_module.task_covar_module.covar_factor)
print("Learned kernel length scales (one for each input) \ell")
print(MTmodel.covar_module.data_covar_module.lengthscale)

# Set models into eval mode
model.eval()
MTmodel.eval()
likelihood.eval()
MTlikelihood.eval()

# Shift this distance away from a train data point
shift = 0.05

# Train data-point
ux1 = train_x[1:2, :]
ux2 = train_x[1:2, :] + shift
# Performing inference on the training data points themselves
with torch.no_grad(), gpytorch.settings.fast_pred_var():
    # Get distributions of type multivariate-normal
    prediction_dist1 = likelihood(*model(ux1, ux1))
    prediction_dist2 = likelihood(*model(ux2, ux2))
    # Get distribution of type multi-task multi-variate normal\
    prediction_dist3 = MTlikelihood(MTmodel(ux1))
    prediction_dist4 = MTlikelihood(MTmodel(ux2))


print("Indp Model List Mean and Variance on a Train Point")
print('mean: ', prediction_dist1[0].mean.detach().numpy(), prediction_dist1[1].mean.detach().numpy())
print('vars: ', prediction_dist1[0].covariance_matrix.detach().numpy(), prediction_dist1[1].covariance_matrix.detach().numpy())
print('------')
print("Indp Model List Mean and Variance Nearby a Train Point")
print('mean: ', prediction_dist2[0].mean.detach().numpy(), prediction_dist2[1].mean.detach().numpy())
print('vars: ', prediction_dist2[0].covariance_matrix.detach().numpy(), prediction_dist2[1].covariance_matrix.detach().numpy())
print('------')
print("MT-Model Mean and Variance on a Train Point")
print('mean: ', prediction_dist3.mean.detach().numpy())
print('vars:\n', prediction_dist3.covariance_matrix.detach().numpy())
print('------')
print("MT-Model Mean and Variance Nearby a Train Point")
print('mean: ', prediction_dist4.mean.detach().numpy())
print('vars:\n', prediction_dist4.covariance_matrix.detach().numpy())
print('------')
print("Actual Data Point (True Label)")
print(train_y[1:2, :])

** Stack trace/error message **

Training Ind. Model List
- - - - - - - - - - 
/home/ishank/Desktop/Gaussian-Process-Dynamics/venv/lib/python3.8/site-packages/gpytorch/utils/linear_cg.py:266: UserWarning: An output with one or more elements was resized since it had shape [11], which does not match the required output shape [1, 11].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at  ../aten/src/ATen/native/Resize.cpp:23.)
  _jit_linear_cg_updates_no_precond(
Iter 5/50 - Loss: 0.652
Iter 10/50 - Loss: 0.430
Iter 15/50 - Loss: 0.200
Iter 20/50 - Loss: -0.039
Iter 25/50 - Loss: -0.280
Iter 30/50 - Loss: -0.533
Iter 35/50 - Loss: -0.793
Iter 40/50 - Loss: -1.047
Iter 45/50 - Loss: -1.304
Iter 50/50 - Loss: -1.555
- - - - - - - - - - 
Training MT Model
- - - - - - - - - - 
Iter 5/50 - Loss: 0.991
Iter 10/50 - Loss: 0.768
Iter 15/50 - Loss: 0.540
Iter 20/50 - Loss: 0.301
Iter 25/50 - Loss: 0.055
Iter 30/50 - Loss: -0.197
Iter 35/50 - Loss: -0.454
Iter 40/50 - Loss: -0.711
Iter 45/50 - Loss: -0.968
Iter 50/50 - Loss: -1.224
- - - - - - - - - - 
- - - - - - - - - 
Model 1a
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior
Parameter containing:
tensor([-0.0137], requires_grad=True)
Learned kernel scale (variance of kernel sigma)
tensor(0.3450, grad_fn=<SoftplusBackward0>)
Learned kernel length scales (one for each input) \ell
tensor([[3.4328, 3.5159, 1.5461, 3.3662]], grad_fn=<SoftplusBackward0>)
- - - - - - - - - 
Model 1b
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior
Parameter containing:
tensor([-0.0089], requires_grad=True)
Learned kernel scale (variance of kernel sigma)
tensor(0.3588, grad_fn=<SoftplusBackward0>)
Learned kernel length scales (one for each input) \ell
tensor([[3.3585, 3.3732, 3.5119, 1.6963]], grad_fn=<SoftplusBackward0>)
- - - - - - - - - 
Model 2 (MultiTask=MT)
- - - - - - - - - 
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior comp. 1
Parameter containing:
tensor([0.0083], requires_grad=True)
Learned constant mean for the prior comp. 2
Parameter containing:
tensor([-0.0142], requires_grad=True)
Learned static Index Kernel/Fixed Covariance K_{TT} matrix
Parameter containing:
tensor([[-0.1277,  0.5546],
        [ 0.6061,  0.2162]], requires_grad=True)
Learned kernel length scales (one for each input) \ell
tensor([[3.3520, 3.3918, 2.7205, 2.7693]], grad_fn=<SoftplusBackward0>)
Indp Model List Mean and Variance on a Train Point
mean:  [-0.66263324] [0.43995908]
vars:  [[0.00555626]] [[0.00556101]]
------
Indp Model List Mean and Variance Nearby a Train Point
mean:  [-0.61773294] [0.5056697]
vars:  [[0.00555528]] [[0.00555663]]
------
MT-Model Mean and Variance on a Train Point
mean:  [[-0.69353205  0.44850466]]
vars:
 [[0.56116694 0.04254041]
 [0.04254041 0.6265246 ]]
------
MT-Model Mean and Variance Nearby a Train Point
mean:  [[-0.6492302   0.51464087]]
vars:
 [[0.56116694 0.04254041]
 [0.04254041 0.6265246 ]]
------
Actual Data Point (True Label)
tensor([[-0.6583,  0.4381]])

Expected Behavior

The covariance matrix of the posterior obtained from the MultiTask kernel model is strangely frozen on- [0.56116694, 0.04254041 0.04254041, 0.6265246 ], For both the train data point and a shifted version of it.

I find 2 problems with the covariance Matrix obtained from the MultiTask version.

  1. The diagonal entries which are the variances are quite large. Much larger than the corresponding variances obtained from the Independent Model List. Also it is strange to me that this be the case since the K_{TT} index kernel matrix associated with the MultiTask kernel has entries smaller than 1.0 (as it should since it is a static matrix of correlation coefficients is my understanding) and the length scale parameters, and the noise variance values (in the range of 1.5-3.5 and 0.0056 respectively) are very similar for the MT-model and the Ind. Model-List. Since the final variance is the composition of K_{TT} and an RBF kernel, I would have expected the variances from the 2 models to have the same order of magnitudeā€¦
  2. They remain completely unchanged between these 2 points (and on other test points for that matter)

System information

Please complete the following information:

  • gpytorch 1.5.1
  • pytorch 1.10.0+cu102
  • Ubuntu 20.04

Additional context

Colab Notebook Version- https://colab.research.google.com/drive/1OalLncVeGtNHh-DqjnkScfy46uTtNud_?usp=sharing

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
wjmaddoxcommented, Nov 22, 2021

This is actually not quite a bug but seems to be caused by your usage of fast_pred_var in the inference step which isnā€™t strictly necessary here.

If I remove that flag (as well as removing the effect of the likelihood), then I get sensible outputs:

ux1 = train_x[1:2, :]
ux2 = train_x[1:2, :] + shift
# Performing inference on the training data points themselves
with torch.no_grad():
    # Get distributions of type multivariate-normal
    # prediction_dist1 = likelihood(*model(ux1, ux1))
    # prediction_dist2 = likelihood(*model(ux2, ux2))
    prediction_dist1 = model(ux1, ux1)
    prediction_dist2 = model(ux2, ux2)
    # Get distribution of type multi-task multi-variate normal\
#     prediction_dist3 = MTlikelihood(MTmodel(ux1))
#     prediction_dist4 = MTlikelihood(MTmodel(ux2))
    prediction_dist3 = MTmodel(ux1)
    prediction_dist4 = MTmodel(ux2)
MT-Model Mean and Variance on a Train Point
mean:  [[-0.6976632  0.4489181]]
vars:
 [[1.15394592e-04 1.92224979e-06]
 [1.93342566e-06 1.18136406e-04]]
------
MT-Model Mean and Variance Nearby a Train Point
mean:  [[-0.65350515  0.5149511 ]]
vars:
 [[1.10507011e-04 1.90734863e-06]
 [1.90734863e-06 1.13129616e-04]]
------
Actual Data Point (True Label)
tensor([[-0.6583,  0.4381]])

If youā€™re wondering why exactly this fixes it, hereā€™s what I was able to trace down:

Numerically, what was happening is that the second term in the posterior covariance was dropping to zero due to something happening in the Lanczos decomposition, aka (K_{x_test, X} \kron K_{TT})(K_{XX} \kron K_{TT} + \Sigma)^{-1}(K_{x_test, X} \kron K_{TT}) \approx 0.

Then, the predictive variance was determined entirely by the first term and the intertask covariance matrix, which was actually matching the predictive variances.

Iā€™ll doublecheck the fast_pred_var to see why these predictive variances were so far off and if thereā€™s a bug there, but in the meantime, itā€™s probably best to not use fast_pred_var if you donā€™t have that many test points.

1reaction
wjmaddoxcommented, Nov 26, 2021

Yes, thatā€™s the correct understanding.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Exact GP Regression with Multiple GPUs and Kernel Partitioning
In this notebook, we'll demonstrate training exact GPs on large datasets using two key features from the paper https://arxiv.org/abs/1903.08114: The ability toĀ ...
Read more >
A Multi-task Learning Model for Daily Activity Forecast ... - NCBI
A daily activity forecast model based on multi-task learning is proposed. ... but they were essentially independent models for sensor andĀ ...
Read more >
BoTorch models - Bayesian Optimization in PyTorch
A multi-output Model represented by a list of independent models. ... Multi-Task GP model using an ICM kernel, inferring observation noise.
Read more >
On Kernel-base Multi-Task Learning
for kernel-based MTL models, due to different characteristics of the ... error tells us that, on average, how much the mis-prediction rate is...
Read more >
A Survey on Multi-Task Learning - arXiv
compared with single-task learning. With more data, MTL can learn more robust and universal representations for multiple tasks and more powerful models,Ā ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found