Training / Evaluation Question
See original GitHub issueHello, I’ve been playing around with gpytorch recently and it is great! I just had a few questions around training and evaluation. These are in the context of exact GPs for regression.
I have something like this for training
for i in range(training_iter): optimizer.zero_grad() output = model(train_x) loss = -mll(output, train_y) loss.backward() print('Iter %d/%d - Loss: %.3f noise: %.3f' % ( i + 1, training_iter, loss.item(), torch.mean(model.likelihood.noise) )) optimizer.step()
and something like this for evaluation
with torch.no_grad(), gpytorch.settings.fast_pred_var(): marginal_likelihood = mll(model(train_x), train_y) print(marginal_likelihood.item())
These unfortunately, give different values. Is there a reason behind this? Shouldn’t they both be capturing the marginal loglikelihood of the training data?
The related question is how are folks training their GPs? The number of iterations seem arbitrary in the docs, so I was wondering if there was some termination condition people generally use. I sometimes see the loss increasing and oscillating (which is totally fine - if I wanted to treat it as a NN, I would lower learning rate etc), but I figured since GPs have closed forms, there should be a better way to do this rather than hyperparameter tuning the learning rate and early stopping.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (2 by maintainers)
Top GitHub Comments
I would definitely use the mll as output in train mode, as that is the objective function you are training through. I don’t think it would make much sense to use your training set for early stopping in eval mode.
I’m going to close this for now, feel free to reopen or open a new issue for further questions.
Gradient descent is only changing the hyperparameters of the kernel. It wont cause the prior to be the same as the posterior. Those will always be difference since one is the distribution before seeing data and the other is a distribution conditioned on data.
The marginal log likelihood of a GP is computed from the prior mean and prior covariance matrix among the training points. Hence
mll
expects the prior multivariate normal distribution as its input. Note that you could pass in any multivariate distribution tomll
and it’ll run without errors. This is why you are able to pass inmodel(x)
, which is always a multivariate normal distribution, even whenmodel
is in.eval()
mode.