[Bug] Different nan handling under GPU and CPU
See original GitHub issue🐛 Bug
There are cases in which code run on CPU will throw a NanError
while the same code run on GPU will throw no error but produce nans, e.g. in training loss.
To reproduce
import math
import gpytorch
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)
# Training data is 100 points in [0,1] inclusive regularly spaced
train_x = torch.linspace(0, 1, 100).to(device)
# True function is sin(2*pi*x) with Gaussian noise
train_y = torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size()).to(device) * math.sqrt(0.04)
# Intentionally corrupt the train_y to give nans
train_y = train_y.log()
print(train_y)
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
# initialize likelihood and model
likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood).to(device)
training_iter = 10
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # Includes GaussianLikelihood parameters
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
for i in range(training_iter):
# Zero gradients from previous iteration
optimizer.zero_grad()
# Output from model
output = model(train_x)
# Calc loss and backprop gradients
loss = -mll(output, train_y)
loss.backward()
print('Iter %d/%d - Loss: %.3f lengthscale: %.3f noise: %.3f' % (
i + 1, training_iter, loss.item(),
model.covar_module.base_kernel.lengthscale.item(),
model.likelihood.noise.item()
))
optimizer.step()
On CPU
Iter 1/10 - Loss: nan lengthscale: 0.693 noise: 0.693
---------------------------------------------------------------------------
NanError Traceback (most recent call last)
<ipython-input-17-ae7f611b728d> in <module>()
17 output = model(train_x)
18 # Calc loss and backprop gradients
---> 19 loss = -mll(output, train_y)
20 loss.backward()
21 print('Iter %d/%d - Loss: %.3f lengthscale: %.3f noise: %.3f' % (
8 frames
/usr/local/lib/python3.7/dist-packages/gpytorch/utils/cholesky.py in _psd_safe_cholesky(A, out, jitter, max_tries)
29 if isnan.any():
30 raise NanError(
---> 31 f"cholesky_cpu: {isnan.sum().item()} of {A.numel()} elements of the {A.shape} tensor are NaN."
32 )
33
NanError: cholesky_cpu: 10000 of 10000 elements of the torch.Size([100, 100]) tensor are NaN.
but on GPU:
Iter 1/10 - Loss: nan lengthscale: 0.693 noise: 0.693
Iter 2/10 - Loss: nan lengthscale: nan noise: nan
Iter 3/10 - Loss: nan lengthscale: nan noise: nan
Iter 4/10 - Loss: nan lengthscale: nan noise: nan
Iter 5/10 - Loss: nan lengthscale: nan noise: nan
Iter 6/10 - Loss: nan lengthscale: nan noise: nan
Iter 7/10 - Loss: nan lengthscale: nan noise: nan
Iter 8/10 - Loss: nan lengthscale: nan noise: nan
Iter 9/10 - Loss: nan lengthscale: nan noise: nan
Iter 10/10 - Loss: nan lengthscale: nan noise: nan
Expected Behavior
This is my query. Is the above now expected or is it a bug? I would have thought it’s best to have consistency between the CPU and GPU. The difference appears to arise from torch itself. Using version 1.8.1, one gets the NanError
in both cases, but using 1.9.0, one gets the behaviour above. Related to https://github.com/pytorch/pytorch/issues/1810?
System information
GPyTorch version: 1.5.1 PyTorch version: 1.9.0 (+cu102) OS: Google Colab notebook.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8
Top Results From Across the Web
NaN behavior different depending on hardware platform?
I'm seeing different behavior in terms of when NaNs are generated between GPUs and CPUs. I know this has been raised ... Could...
Read more >NaN values in tensorflow when running from GPU and using ...
This problem appeared to be related to having some of the tf.image processing functions defined on the GPU, but feeding them data from...
Read more >Differences between CuPy and NumPy
float32 ) are aliases of NumPy scalar values and are allocated in CPU memory. If these types were returned, it would be required...
Read more >Array stored on GPU - MATLAB - MathWorks
A gpuArray object represents an array stored in GPU memory. A large number of functions in MATLAB ® and in other toolboxes support...
Read more >Train With Mixed Precision - NVIDIA Documentation Center
NVIDIA GPUs offer up to 8x more half precision arithmetic throughput when ... In other words, Tensor Core math can accumulate half precision ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Using 1.9.0 from PyPI.
Hmm so @wjmaddox the code path above is actually going through
torch.linalg.cholesky_ex
viapsd_safe_cholesky
. It looks like on recent torch versioncholesky_ex
doesn’t properly surface the info code?@mruberry any thoughts why this may happen?