Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Different nan handling under GPU and CPU

See original GitHub issue

🐛 Bug

There are cases in which code run on CPU will throw a NanError while the same code run on GPU will throw no error but produce nans, e.g. in training loss.

To reproduce

import math 

import gpytorch 
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100).to(device)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()).to(device) * math.sqrt(0.04)

# Intentionally corrupt the train_y to give nans
train_y = train_y.log()
print(train_y)

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood).to(device)

training_iter = 10

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item()
    ))
    optimizer.step()

On CPU

Iter 1/10 - Loss: nan   lengthscale: 0.693   noise: 0.693
---------------------------------------------------------------------------
NanError                                  Traceback (most recent call last)
<ipython-input-17-ae7f611b728d> in <module>()
     17     output = model(train_x)
     18     # Calc loss and backprop gradients
---> 19     loss = -mll(output, train_y)
     20     loss.backward()
     21     print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (

8 frames
/usr/local/lib/python3.7/dist-packages/gpytorch/utils/cholesky.py in _psd_safe_cholesky(A, out, jitter, max_tries)
     29         if isnan.any():
     30             raise NanError(
---> 31                 f"cholesky_cpu: {isnan.sum().item()} of {A.numel()} elements of the {A.shape} tensor are NaN."
     32             )
     33 

NanError: cholesky_cpu: 10000 of 10000 elements of the torch.Size([100, 100]) tensor are NaN.

but on GPU:

Iter 1/10 - Loss: nan   lengthscale: 0.693   noise: 0.693
Iter 2/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 3/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 4/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 5/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 6/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 7/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 8/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 9/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 10/10 - Loss: nan   lengthscale: nan   noise: nan

Expected Behavior

This is my query. Is the above now expected or is it a bug? I would have thought it’s best to have consistency between the CPU and GPU. The difference appears to arise from torch itself. Using version 1.8.1, one gets the NanError in both cases, but using 1.9.0, one gets the behaviour above. Related to https://github.com/pytorch/pytorch/issues/1810?

System information

GPyTorch version: 1.5.1 PyTorch version: 1.9.0 (+cu102) OS: Google Colab notebook.

Issue Analytics

State:
Created 2 years ago
Comments:8

Top GitHub Comments

2reactions

npbaskervillecommented, Sep 10, 2021

Using 1.9.0 from PyPI.

2reactions

Balandatcommented, Sep 9, 2021

Hmm so @wjmaddox the code path above is actually going through torch.linalg.cholesky_ex via psd_safe_cholesky. It looks like on recent torch version cholesky_ex doesn’t properly surface the info code?

>> torch.linalg.cholesky_ex(resp)

torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]]),
info=tensor(1, dtype=torch.int32))

>> torch.linalg.cholesky_ex(resp.to(torch.device("cuda")))

torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0'),
info=tensor(0, device='cuda:0', dtype=torch.int32))

@mruberry any thoughts why this may happen?