question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Different nan handling under GPU and CPU

See original GitHub issue

🐛 Bug

There are cases in which code run on CPU will throw a NanError while the same code run on GPU will throw no error but produce nans, e.g. in training loss.

To reproduce

import math 

import gpytorch 
import torch

device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100).to(device)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()).to(device) * math.sqrt(0.04)

# Intentionally corrupt the train_y to give nans
train_y = train_y.log()
print(train_y)

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
    
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)

# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood).to(device)

training_iter = 10

# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iter):
    # Zero gradients from previous iteration
    optimizer.zero_grad()
    # Output from model
    output = model(train_x)
    # Calc loss and backprop gradients
    loss = -mll(output, train_y)
    loss.backward()
    print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (
        i + 1, training_iter, loss.item(),
        model.covar_module.base_kernel.lengthscale.item(),
        model.likelihood.noise.item()
    ))
    optimizer.step()

On CPU

Iter 1/10 - Loss: nan   lengthscale: 0.693   noise: 0.693
---------------------------------------------------------------------------
NanError                                  Traceback (most recent call last)
<ipython-input-17-ae7f611b728d> in <module>()
     17     output = model(train_x)
     18     # Calc loss and backprop gradients
---> 19     loss = -mll(output, train_y)
     20     loss.backward()
     21     print('Iter %d/%d - Loss: %.3f   lengthscale: %.3f   noise: %.3f' % (

8 frames
/usr/local/lib/python3.7/dist-packages/gpytorch/utils/cholesky.py in _psd_safe_cholesky(A, out, jitter, max_tries)
     29         if isnan.any():
     30             raise NanError(
---> 31                 f"cholesky_cpu: {isnan.sum().item()} of {A.numel()} elements of the {A.shape} tensor are NaN."
     32             )
     33 

NanError: cholesky_cpu: 10000 of 10000 elements of the torch.Size([100, 100]) tensor are NaN.

but on GPU:

Iter 1/10 - Loss: nan   lengthscale: 0.693   noise: 0.693
Iter 2/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 3/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 4/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 5/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 6/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 7/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 8/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 9/10 - Loss: nan   lengthscale: nan   noise: nan
Iter 10/10 - Loss: nan   lengthscale: nan   noise: nan

Expected Behavior

This is my query. Is the above now expected or is it a bug? I would have thought it’s best to have consistency between the CPU and GPU. The difference appears to arise from torch itself. Using version 1.8.1, one gets the NanError in both cases, but using 1.9.0, one gets the behaviour above. Related to https://github.com/pytorch/pytorch/issues/1810?

System information

GPyTorch version: 1.5.1 PyTorch version: 1.9.0 (+cu102) OS: Google Colab notebook.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8

github_iconTop GitHub Comments

2reactions
npbaskervillecommented, Sep 10, 2021

Using 1.9.0 from PyPI.

2reactions
Balandatcommented, Sep 9, 2021

Hmm so @wjmaddox the code path above is actually going through torch.linalg.cholesky_ex via psd_safe_cholesky. It looks like on recent torch version cholesky_ex doesn’t properly surface the info code?

>> torch.linalg.cholesky_ex(resp)

torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]]),
info=tensor(1, dtype=torch.int32))

>> torch.linalg.cholesky_ex(resp.to(torch.device("cuda")))

torch.return_types.linalg_cholesky_ex(
L=tensor([[nan, 0., 0.,  ..., 0., 0., 0.],
        [nan, nan, 0.,  ..., 0., 0., 0.],
        [nan, nan, nan,  ..., 0., 0., 0.],
        ...,
        [nan, nan, nan,  ..., nan, 0., 0.],
        [nan, nan, nan,  ..., nan, nan, 0.],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0'),
info=tensor(0, device='cuda:0', dtype=torch.int32))

@mruberry any thoughts why this may happen?

Read more comments on GitHub >

github_iconTop Results From Across the Web

NaN behavior different depending on hardware platform?
I'm seeing different behavior in terms of when NaNs are generated between GPUs and CPUs. I know this has been raised ... Could...
Read more >
NaN values in tensorflow when running from GPU and using ...
This problem appeared to be related to having some of the tf.image processing functions defined on the GPU, but feeding them data from...
Read more >
Differences between CuPy and NumPy
float32 ) are aliases of NumPy scalar values and are allocated in CPU memory. If these types were returned, it would be required...
Read more >
Array stored on GPU - MATLAB - MathWorks
A gpuArray object represents an array stored in GPU memory. A large number of functions in MATLAB ® and in other toolboxes support...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when ... In other words, Tensor Core math can accumulate half precision ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found