[Bug] Invalid index in gather during call to mll in training loop
See original GitHub issue🐛 Bug
I followed the docs related to DKL exactly and got the following error during the training loop when calculating the loss with the marginal likelihood function
** Stack trace/error message **
Traceback (most recent call last):
File "gp.py", line 24, in <module>
exact.train()
File "/st2/jeff/real_estate/models/gaussian_processes/exact.py", line 102, in train
loss = -mll(output, train_y).sum()
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/module.py", line 22, in __call__
outputs = self.forward(*inputs, **kwargs)
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/mlls/exact_marginal_log_likelihood.py", line 27, in forward
res = output.log_prob(target)
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/distributions/multivariate_normal.py", line 128, in log_prob
inv_quad, logdet = covar.inv_quad_logdet(inv_quad_rhs=diff.unsqueeze(-1), logdet=True)
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/batch_repeat_lazy_tensor.py", line 242, in inv_quad_logdet
inv_quad_rhs, logdet, reduce_inv_quad=False
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/lazy_tensor.py", line 1052, in inv_quad_logdet
*args,
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/functions/_inv_quad_log_det.py", line 63, in forward
preconditioner, precond_lt, logdet_correction = lazy_tsr._preconditioner()
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/added_diag_lazy_tensor.py", line 59, in _preconditioner
self._piv_chol_self = pivoted_cholesky.pivoted_cholesky(self._lazy_tensor, max_iter)
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/utils/pivoted_cholesky.py", line 19, in pivoted_cholesky
matrix_diag = matrix._approx_diag()
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/lazy/interpolated_lazy_tensor.py", line 90, in _approx_diag
left_res = left_interp(self.left_interp_indices, self.left_interp_values, base_diag_root.unsqueeze(-1))
File "/st2/jeff/anaconda3/envs/jeff/lib/python3.7/site-packages/gpytorch/utils/interpolation.py", line 187, in left_interp
res = rhs_expanded.gather(-3, interp_indices_expanded).mul(interp_values_expanded)
RuntimeError: Invalid index in gather at /tmp/pip-req-build-58y_cjjl/aten/src/TH/generic/THTensorEvenMoreMath.cpp:472
loss: 57158.71 med: 0.30, minmax: 0.30 0.30 noise: 0.56: : 0it [00:08, ?it/s]
Expected Behavior
I am left unsure of what is causing the error and how to go about fixing it because it is initially successful in iterating and calculating the loss and then it crashes. The sizing of tensors must be correct, but there is must be some numerical instability and I am unsure about where to look for it.
System information
Please complete the following information:
gpytorch version: 0.3.6
torch version: `1.2.0
Ubuntu 18.04
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Pytorch RuntimeError: Invalid index in gather - Stack Overflow
Show activity on this post. This simply means your index tensor c has invalid indices. For example, the following index tensor is valid:...
Read more >Bulding multiclass classification model · Issue #1003 - GitHub
mgarort mentioned this issue on Dec 25, 2019. [Bug] Invalid index in gather during call to mll in training loop #955.
Read more >tf.gather | TensorFlow v2.11.0
Indices are always validated on CPU and never validated on GPU. Gather slices from params axis axis according to indices . indices must...
Read more >NN: Invalid index in gather - PyTorch Forums
I am new to PyTorch and am getting the following error when building a neural network. Please let me know if you need...
Read more >SageMaker Training Compiler Troubleshooting
If a training job fails with the Missing XLA configuration error message, it might be due to a misconfiguration in the number of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@deltaskelta - it looks like these NaNs came from a divide-by-zero error that #971 addresses. However, this error is mostly stemming from the neural network outputs collapsing to a single point, which will cause lots of other learning issues for the GP.
To solve that larger issue, I would either
Hi,
I also received the same error in a GP for multiclass classification like the one in #1003 . I am using
gpytorch 1.0.0
andtorch 1.3.1
.Fortunately it went away after normalizing the inputs properly.