Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Invalid index in gather during call to mll in training loop

See original GitHub issue

🐛 Bug

I followed the docs related to DKL exactly and got the following error during the training loop when calculating the loss with the marginal likelihood function

** Stack trace/error message **

Traceback (most recent call last):
  File "gp.py", line 24, in <module>
    exact.train()
  File "/st2/jeff/real_estate/models/gaussian_processes/exact.py", line 102, in train
    loss = -mll(output, train_y).sum()
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/module.py", line 22, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 27, in forward
    res = output.log_prob(target)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 128, in log_prob
    inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/batch_repeat_lazy_tensor.py", line 242, in inv_quad_logdet
    inv_quad_rhs, logdet, reduce_inv_quad=False
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 1052, in inv_quad_logdet
    *args,
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/functions/_inv_quad_log_det.py", line 63, in forward
    preconditioner, precond_lt, logdet_correction = lazy_tsr._preconditioner()
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/added_diag_lazy_tensor.py", line 59, in _preconditioner
    self._piv_chol_self = pivoted_cholesky.pivoted_cholesky(self._lazy_tensor, max_iter)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/utils/pivoted_cholesky.py", line 19, in pivoted_cholesky
    matrix_diag = matrix._approx_diag()
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/interpolated_lazy_tensor.py", line 90, in _approx_diag
    left_res = left_interp(self.left_interp_indices, self.left_interp_values, base_diag_root.unsqueeze(-1))
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/utils/interpolation.py", line 187, in left_interp
    res = rhs_expanded.gather(-3, interp_indices_expanded).mul(interp_values_expanded)
RuntimeError: Invalid index in gather at /tmp/pip-req-build-58y_cjjl/aten/src/TH/generic/THTensorEvenMoreMath.cpp:472
loss: 57158.71 med: 0.30, minmax: 0.30 0.30 noise: 0.56: : 0it [00:08, ?it/s]

Expected Behavior

I am left unsure of what is causing the error and how to go about fixing it because it is initially successful in iterating and calculating the loss and then it crashes. The sizing of tensors must be correct, but there is must be some numerical instability and I am unsure about where to look for it.

System information

Please complete the following information: gpytorch version: 0.3.6 torch version: `1.2.0 Ubuntu 18.04

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

gpleisscommented, Nov 25, 2019

@deltaskelta - it looks like these NaNs came from a divide-by-zero error that #971 addresses. However, this error is mostly stemming from the neural network outputs collapsing to a single point, which will cause lots of other learning issues for the GP.

To solve that larger issue, I would either

Pre-train the neural network without a GP for a few iterations. This is especially useful for large NNs, like the one in your example.
Use batch normalization
Make sure that the outputs are scaled to be zero mean + unit variance.

0reactions

mgarortcommented, Dec 26, 2019

Hi,

I also received the same error in a GP for multiclass classification like the one in #1003 . I am using gpytorch 1.0.0 and torch 1.3.1.

Fortunately it went away after normalizing the inputs properly.