[Bug?] Much larger variance with MultiTask kernel compared to Independent Model List
See original GitHub issueš Bug
On training a MultiTask kernel based model and a collection of independent models tied together in an independent model list object on the same dataset, I see variance magnitudes that are orders of magnitude different. It is unclear why this is the case since the model parameters common to the 2 learnt models (The MultiTask model MTmodel
and the Independent Model List model
) seem to quite similar.
To reproduce
** Code snippet to reproduce **
import torch
import gpytorch
import numpy as np
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=4))
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
class MultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
# https://docs.gpytorch.ai/en/stable/means.html#multitaskmean
self.mean_module = gpytorch.means.MultitaskMean(
gpytorch.means.ConstantMean(), num_tasks=2
)
# Composition of index kernel and RBF kernel
self.covar_module = gpytorch.kernels.MultitaskKernel(
gpytorch.kernels.RBFKernel(ard_num_dims=4), num_tasks=2, rank=2
)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
# https://docs.gpytorch.ai/en/stable/distributions.html#multitaskmultivariatenormal
return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x, interleaved=False)
# Number of train samples
nsamps = 1000
# Fix seed for reproducability
np.random.seed(10)
# Joint input space tensor (u_t, x_t) to hold all inputs and trajectories
train_x = torch.tensor(np.random.uniform(low=-1.0, high=1.0, size=(nsamps, 4))).float()
# Generate output samples
# A and B matrices
A = torch.tensor([[1., 0.],
[0., 1.]])
B = torch.tensor([[-0.2, 0.1],
[0.15, 0.15]])
# Output states starting time index 1, no observation noise
train_y = torch.zeros(nsamps, 2)
# Obtain the output states $(x_{t+1, 1}, x_{t+1, 2})$
for i in range(nsamps):
# Get the output next state
x_next = torch.matmul(A, train_x[i, 2:4]) + torch.matmul(B, train_x[i, 0:2])
# No observation noise added
train_y[i, :] = x_next
# dataset = torch.cat([train_x, train_y], dim=1)
likelihood1 = gpytorch.likelihoods.GaussianLikelihood()
model1 = ExactGPModel(train_x, train_y[:, 0], likelihood1)
likelihood2 = gpytorch.likelihoods.GaussianLikelihood()
model2 = ExactGPModel(train_x, train_y[:, 1], likelihood2)
# Collect the sub-models in an IndependentMultiOutputGP, and the respective likelihoods in a MultiOutputLikelihood
model = gpytorch.models.IndependentModelList(model1, model2)
likelihood = gpytorch.likelihoods.LikelihoodList(model1.likelihood, model2.likelihood)
mll = gpytorch.mlls.SumMarginalLogLikelihood(likelihood, model)
# Perform Ind. Model List Training
training_iterations = 50
# Find optimal model hyper-parameters
model.train()
likelihood.train()
# Use the Adam optimizer
optimizer = torch.optim.Adam([
{'params': model.parameters()}, # Includes all submodel and all likelihood parameters
], lr=0.1)
print("Training Ind. Model List\n- - - - - - - - - - ")
for i in range(training_iterations):
optimizer.zero_grad()
output = model(*model.train_inputs)
loss = -mll(output, model.train_targets)
loss.backward()
if (i + 1) % 5 == 0:
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()
print("- - - - - - - - - - ")
# MTGaussianLikelihood allows for modelling a full 2x2 Noise Cov. Prior
MTlikelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
MTmodel = MultitaskGPModel(train_x, train_y, MTlikelihood)
training_iterations = 50
# Find optimal MTmodel hyperparameters
MTmodel.train()
MTlikelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam([
{'params': MTmodel.parameters()}, # Includes GaussianLikelihood parameters
], lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(MTlikelihood, MTmodel)
print("Training MT Model\n- - - - - - - - - - ")
for i in range(training_iterations):
optimizer.zero_grad()
output = MTmodel(train_x)
loss = -mll(output, train_y)
loss.backward()
if (i + 1) % 5 == 0:
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
optimizer.step()
print("- - - - - - - - - - ")
# View the parameters (and others specific to MT)-
# (1) Learned value of observation-noise covariance
# (2) Learned constant mean prior
# (3) Learned kernel scale parameter (\sigma)
# (4) Learned kernel length scale (\ell)
# Output 1 y_{a}
print("- - - - - - - - - \nModel 1a\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[0].likelihood.noise_covar.noise)
print("Learned constant mean for the prior")
print(model.models[0].mean_module.constant)
print("Learned kernel scale (variance of kernel sigma)")
print(model.models[0].covar_module.outputscale)
print("Learned kernel length scales (one for each input) \ell")
print(model.models[0].covar_module.base_kernel.lengthscale)
# Output 2 y_{b}
print("- - - - - - - - - \nModel 1b\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(model.models[1].likelihood.noise_covar.noise)
print("Learned constant mean for the prior")
print(model.models[1].mean_module.constant)
print("Learned kernel scale (variance of kernel sigma)")
print(model.models[1].covar_module.outputscale)
print("Learned kernel length scales (one for each input) \ell")
print(model.models[1].covar_module.base_kernel.lengthscale)
# MT Model
print("- - - - - - - - - \nModel 2 (MultiTask=MT)\n- - - - - - - - - ")
print("Learned Noise Covariance")
print(MTmodel.likelihood.noise)
print("Learned constant mean for the prior comp. 1")
print(MTmodel.mean_module.base_means[0].constant)
print("Learned constant mean for the prior comp. 2")
print(MTmodel.mean_module.base_means[1].constant)
print("Learned static Index Kernel/Fixed Covariance K_{TT} matrix")
print(MTmodel.covar_module.task_covar_module.covar_factor)
print("Learned kernel length scales (one for each input) \ell")
print(MTmodel.covar_module.data_covar_module.lengthscale)
# Set models into eval mode
model.eval()
MTmodel.eval()
likelihood.eval()
MTlikelihood.eval()
# Shift this distance away from a train data point
shift = 0.05
# Train data-point
ux1 = train_x[1:2, :]
ux2 = train_x[1:2, :] + shift
# Performing inference on the training data points themselves
with torch.no_grad(), gpytorch.settings.fast_pred_var():
# Get distributions of type multivariate-normal
prediction_dist1 = likelihood(*model(ux1, ux1))
prediction_dist2 = likelihood(*model(ux2, ux2))
# Get distribution of type multi-task multi-variate normal\
prediction_dist3 = MTlikelihood(MTmodel(ux1))
prediction_dist4 = MTlikelihood(MTmodel(ux2))
print("Indp Model List Mean and Variance on a Train Point")
print('mean: ', prediction_dist1[0].mean.detach().numpy(), prediction_dist1[1].mean.detach().numpy())
print('vars: ', prediction_dist1[0].covariance_matrix.detach().numpy(), prediction_dist1[1].covariance_matrix.detach().numpy())
print('------')
print("Indp Model List Mean and Variance Nearby a Train Point")
print('mean: ', prediction_dist2[0].mean.detach().numpy(), prediction_dist2[1].mean.detach().numpy())
print('vars: ', prediction_dist2[0].covariance_matrix.detach().numpy(), prediction_dist2[1].covariance_matrix.detach().numpy())
print('------')
print("MT-Model Mean and Variance on a Train Point")
print('mean: ', prediction_dist3.mean.detach().numpy())
print('vars:\n', prediction_dist3.covariance_matrix.detach().numpy())
print('------')
print("MT-Model Mean and Variance Nearby a Train Point")
print('mean: ', prediction_dist4.mean.detach().numpy())
print('vars:\n', prediction_dist4.covariance_matrix.detach().numpy())
print('------')
print("Actual Data Point (True Label)")
print(train_y[1:2, :])
** Stack trace/error message **
Training Ind. Model List
- - - - - - - - - -
/home/ishank/Desktop/Gaussian-Process-Dynamics/venv/lib/python3.8/site-packages/gpytorch/utils/linear_cg.py:266: UserWarning: An output with one or more elements was resized since it had shape [11], which does not match the required output shape [1, 11].This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:23.)
_jit_linear_cg_updates_no_precond(
Iter 5/50 - Loss: 0.652
Iter 10/50 - Loss: 0.430
Iter 15/50 - Loss: 0.200
Iter 20/50 - Loss: -0.039
Iter 25/50 - Loss: -0.280
Iter 30/50 - Loss: -0.533
Iter 35/50 - Loss: -0.793
Iter 40/50 - Loss: -1.047
Iter 45/50 - Loss: -1.304
Iter 50/50 - Loss: -1.555
- - - - - - - - - -
Training MT Model
- - - - - - - - - -
Iter 5/50 - Loss: 0.991
Iter 10/50 - Loss: 0.768
Iter 15/50 - Loss: 0.540
Iter 20/50 - Loss: 0.301
Iter 25/50 - Loss: 0.055
Iter 30/50 - Loss: -0.197
Iter 35/50 - Loss: -0.454
Iter 40/50 - Loss: -0.711
Iter 45/50 - Loss: -0.968
Iter 50/50 - Loss: -1.224
- - - - - - - - - -
- - - - - - - - -
Model 1a
- - - - - - - - -
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior
Parameter containing:
tensor([-0.0137], requires_grad=True)
Learned kernel scale (variance of kernel sigma)
tensor(0.3450, grad_fn=<SoftplusBackward0>)
Learned kernel length scales (one for each input) \ell
tensor([[3.4328, 3.5159, 1.5461, 3.3662]], grad_fn=<SoftplusBackward0>)
- - - - - - - - -
Model 1b
- - - - - - - - -
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior
Parameter containing:
tensor([-0.0089], requires_grad=True)
Learned kernel scale (variance of kernel sigma)
tensor(0.3588, grad_fn=<SoftplusBackward0>)
Learned kernel length scales (one for each input) \ell
tensor([[3.3585, 3.3732, 3.5119, 1.6963]], grad_fn=<SoftplusBackward0>)
- - - - - - - - -
Model 2 (MultiTask=MT)
- - - - - - - - -
Learned Noise Covariance
tensor([0.0055], grad_fn=<AddBackward0>)
Learned constant mean for the prior comp. 1
Parameter containing:
tensor([0.0083], requires_grad=True)
Learned constant mean for the prior comp. 2
Parameter containing:
tensor([-0.0142], requires_grad=True)
Learned static Index Kernel/Fixed Covariance K_{TT} matrix
Parameter containing:
tensor([[-0.1277, 0.5546],
[ 0.6061, 0.2162]], requires_grad=True)
Learned kernel length scales (one for each input) \ell
tensor([[3.3520, 3.3918, 2.7205, 2.7693]], grad_fn=<SoftplusBackward0>)
Indp Model List Mean and Variance on a Train Point
mean: [-0.66263324] [0.43995908]
vars: [[0.00555626]] [[0.00556101]]
------
Indp Model List Mean and Variance Nearby a Train Point
mean: [-0.61773294] [0.5056697]
vars: [[0.00555528]] [[0.00555663]]
------
MT-Model Mean and Variance on a Train Point
mean: [[-0.69353205 0.44850466]]
vars:
[[0.56116694 0.04254041]
[0.04254041 0.6265246 ]]
------
MT-Model Mean and Variance Nearby a Train Point
mean: [[-0.6492302 0.51464087]]
vars:
[[0.56116694 0.04254041]
[0.04254041 0.6265246 ]]
------
Actual Data Point (True Label)
tensor([[-0.6583, 0.4381]])
Expected Behavior
The covariance matrix of the posterior obtained from the MultiTask kernel model is strangely frozen on- [0.56116694, 0.04254041 0.04254041, 0.6265246 ], For both the train data point and a shifted version of it.
I find 2 problems with the covariance Matrix obtained from the MultiTask version.
- The diagonal entries which are the variances are quite large. Much larger than the corresponding variances obtained from the Independent Model List. Also it is strange to me that this be the case since the K_{TT} index kernel matrix associated with the MultiTask kernel has entries smaller than 1.0 (as it should since it is a static matrix of correlation coefficients is my understanding) and the length scale parameters, and the noise variance values (in the range of 1.5-3.5 and 0.0056 respectively) are very similar for the MT-model and the Ind. Model-List. Since the final variance is the composition of K_{TT} and an RBF kernel, I would have expected the variances from the 2 models to have the same order of magnitudeā¦
- They remain completely unchanged between these 2 points (and on other test points for that matter)
System information
Please complete the following information:
- gpytorch 1.5.1
- pytorch 1.10.0+cu102
- Ubuntu 20.04
Additional context
Colab Notebook Version- https://colab.research.google.com/drive/1OalLncVeGtNHh-DqjnkScfy46uTtNud_?usp=sharing
Issue Analytics
- State:
- Created 2 years ago
- Comments:7
Top GitHub Comments
This is actually not quite a bug but seems to be caused by your usage of
fast_pred_var
in the inference step which isnāt strictly necessary here.If I remove that flag (as well as removing the effect of the likelihood), then I get sensible outputs:
If youāre wondering why exactly this fixes it, hereās what I was able to trace down:
Numerically, what was happening is that the second term in the posterior covariance was dropping to zero due to something happening in the Lanczos decomposition, aka
(K_{x_test, X} \kron K_{TT})(K_{XX} \kron K_{TT} + \Sigma)^{-1}(K_{x_test, X} \kron K_{TT}) \approx 0
.Then, the predictive variance was determined entirely by the first term and the intertask covariance matrix, which was actually matching the predictive variances.
Iāll doublecheck the fast_pred_var to see why these predictive variances were so far off and if thereās a bug there, but in the meantime, itās probably best to not use
fast_pred_var
if you donāt have that many test points.Yes, thatās the correct understanding.