question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Invalid index in gather during call to mll in training loop

See original GitHub issue

🐛 Bug

I followed the docs related to DKL exactly and got the following error during the training loop when calculating the loss with the marginal likelihood function

** Stack trace/error message **

Traceback (most recent call last):
  File "gp.py", line 24, in <module>
    exact.train()
  File "/st2/jeff/real_estate/models/gaussian_processes/exact.py", line 102, in train
    loss = -mll(output, train_y).sum()
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/module.py", line 22, in __call__
    outputs = self.forward(*inputs, **kwargs)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 27, in forward
    res = output.log_prob(target)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 128, in log_prob
    inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/batch_repeat_lazy_tensor.py", line 242, in inv_quad_logdet
    inv_quad_rhs, logdet, reduce_inv_quad=False
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 1052, in inv_quad_logdet
    *args,
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/functions/_inv_quad_log_det.py", line 63, in forward
    preconditioner, precond_lt, logdet_correction = lazy_tsr._preconditioner()
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/added_diag_lazy_tensor.py", line 59, in _preconditioner
    self._piv_chol_self = pivoted_cholesky.pivoted_cholesky(self._lazy_tensor, max_iter)
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/utils/pivoted_cholesky.py", line 19, in pivoted_cholesky
    matrix_diag = matrix._approx_diag()
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/interpolated_lazy_tensor.py", line 90, in _approx_diag
    left_res = left_interp(self.left_interp_indices, self.left_interp_values, base_diag_root.unsqueeze(-1))
  File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/utils/interpolation.py", line 187, in left_interp
    res = rhs_expanded.gather(-3, interp_indices_expanded).mul(interp_values_expanded)
RuntimeError: Invalid index in gather at /tmp/pip-req-build-58y_cjjl/aten/src/TH/generic/THTensorEvenMoreMath.cpp:472
loss: 57158.71 med: 0.30, minmax: 0.30 0.30 noise: 0.56: : 0it [00:08, ?it/s]

Expected Behavior

I am left unsure of what is causing the error and how to go about fixing it because it is initially successful in iterating and calculating the loss and then it crashes. The sizing of tensors must be correct, but there is must be some numerical instability and I am unsure about where to look for it.

System information

Please complete the following information: gpytorch version: 0.3.6 torch version: `1.2.0 Ubuntu 18.04

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
gpleisscommented, Nov 25, 2019

@deltaskelta - it looks like these NaNs came from a divide-by-zero error that #971 addresses. However, this error is mostly stemming from the neural network outputs collapsing to a single point, which will cause lots of other learning issues for the GP.

To solve that larger issue, I would either

  1. Pre-train the neural network without a GP for a few iterations. This is especially useful for large NNs, like the one in your example.
  2. Use batch normalization
  3. Make sure that the outputs are scaled to be zero mean + unit variance.
0reactions
mgarortcommented, Dec 26, 2019

Hi,

I also received the same error in a GP for multiclass classification like the one in #1003 . I am using gpytorch 1.0.0 and torch 1.3.1.

Fortunately it went away after normalizing the inputs properly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pytorch RuntimeError: Invalid index in gather - Stack Overflow
Show activity on this post. This simply means your index tensor c has invalid indices. For example, the following index tensor is valid:...
Read more >
Bulding multiclass classification model · Issue #1003 - GitHub
mgarort mentioned this issue on Dec 25, 2019. [Bug] Invalid index in gather during call to mll in training loop #955.
Read more >
tf.gather | TensorFlow v2.11.0
Indices are always validated on CPU and never validated on GPU. Gather slices from params axis axis according to indices . indices must...
Read more >
NN: Invalid index in gather - PyTorch Forums
I am new to PyTorch and am getting the following error when building a neural network. Please let me know if you need...
Read more >
SageMaker Training Compiler Troubleshooting
If a training job fails with the Missing XLA configuration error message, it might be due to a misconfiguration in the number of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found