[Bug] in version 0.3.6 , the kissGP example cannot be put on CUDA
See original GitHub issue🐛 Bug
Hi, I directly installed gpytorch alpha version (compatible with my pytorch version, see below), and started with this tutorial for kissGP
However upon running the code on GPU (w/ only 1 gpu, i’m on a laptop) one encounters this
To reproduce
Code snippet to reproduce
ll = gpt.likelihoods.GaussianLikelihood().to('cuda:0')
m = GPRegressionModel(x_tr, y_tr, ll)
m = m.to('cuda:0')
# Find optimal model hyperparameters
m.train()
ll.train()
# Use the adam optimizer
opt = th.optim.Adam(
[{'params': m.parameters()},],
lr=0.1)
# Includes GaussianLikelihood parameters
# "Loss" for GPs - the marginal log likelihood
mll = gpt.mlls.ExactMarginalLogLikelihood(ll, m)
training_iterations = 30
for i in range(training_iterations):
opt.zero_grad()
output = m(x_tr)
print(x_tr.device, output, y_tr.device)
loss = -mll(output, y_tr)
loss.backward()
print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))
opt.step()
Stack trace/error message
~/SJTU/research_code/TCEP/GP_scoring/gpytorch_local/gpytorch/kernels/rbf_kernel.py in forward(self, x1, x2, diag, **params)
80 x2,
81 self.lengthscale,
---> 82 lambda x1, x2: self.covar_dist(
83 x1, x2, square_dist=True, diag=False, dist_postprocess_func=postprocess_rbf, postprocess=False, **params
84 ),
~/SJTU/research_code/TCEP/GP_scoring/gpytorch_local/gpytorch/functions/rbf_covariance.py in forward(ctx, x1, x2, lengthscale, sq_dist_func)
10 raise ValueError("RBFCovariance cannot handle multiple lengthscales")
11 needs_grad = any(ctx.needs_input_grad)
---> 12 x1_ = x1.div(lengthscale)
13 x2_ = x2.div(lengthscale)
14 unitless_sq_dist = sq_dist_func(x1_, x2_)
RuntimeError: expected device cpu and dtype Float but got device cuda:0 and dtype Float
Expected Behavior
I expected all the model parameters to be on GPU, however the basic model you give is
class GPRegressionModel(gpt.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)
# SKI requires a grid size hyperparameter.
#This util can help with that. Here we are using a grid
#that has the same number of points as the training data
#(a ratio of 1.0).
#Performance can be sensitive to this parameter,
#so you may want to adjust it for your own problem
#on a validation set.
grid_size = gpt.utils.grid.choose_grid_size(train_x,1.0)
self.mean_module = gpt.means.ConstantMean()
self.covar_module = gpt.kernels.GridInterpolationKernel(
gpt.kernels.ScaleKernel(gpt.kernels.RBFKernel()),
grid_size=grid_size, num_dims=1,
)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpt.distributions.MultivariateNormal(mean_x, covar_x)
which returns a gpt.distributions.MultivariateNormal
which cannot be put on GPU. This is solved when I clone the code and modify as follows
class RBFCovariance(torch.autograd.Function):
@staticmethod
def forward(ctx, x1, x2, lengthscale, sq_dist_func):
if any(ctx.needs_input_grad[:2]):
raise RuntimeError("RBFCovariance cannot compute gradients with " "respect to x1 and x2")
if lengthscale.size(-1) > 1:
raise ValueError("RBFCovariance cannot handle multiple lengthscales")
needs_grad = any(ctx.needs_input_grad)
x1_ = x1.to('cuda:0').div(lengthscale)
x2_ = x2.to('cuda:0').div(lengthscale)
unitless_sq_dist = sq_dist_func(x1_, x2_)
# clone because inplace operations will mess with what's saved for backward
unitless_sq_dist_ = unitless_sq_dist.clone() if needs_grad else unitless_sq_dist
covar_mat = unitless_sq_dist_.div_(-2.0).exp_()
if needs_grad:
d_output_d_input = unitless_sq_dist.mul_(covar_mat).div_(lengthscale)
ctx.save_for_backward(d_output_d_input)
return covar_mat
BUT then the interpolate
method is not put on GPU either
~/SJTU/research_code/TCEP/GP_scoring/gpytorch_local/gpytorch/utils/interpolation.py in interpolate(self, x_grid, x_target, interp_points, eps)
112
113 # get the interp. coeff. based on distances to interpolating points
--> 114 scaled_dist = lower_pt_rel_dists.unsqueeze(-1) + interp_points_flip.unsqueeze(-2)
115 dim_interp_values = self._cubic_interpolation_kernel(scaled_dist)
116
RuntimeError: expected device cuda:0 and dtype Float but got device cpu and dtype Float
System information
Please complete the following information:
- 0.3.6
- 1.2.0
- Ubuntu 18.04
Additional context
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (2 by maintainers)
Top Results From Across the Web
CUDA C++ Best Practices Guide
CUDA C++ Best Practices Guide. The programming guide to using the CUDA Toolkit to obtain the best performance from NVIDIA GPUs.
Read more >Changelog — PyTorch Lightning 1.8.6 documentation
Trainer queries the CUDA devices through NVML if available to avoid initializing CUDA before forking, which eliminates the need for the PL_DISABLE_FORK ...
Read more >Latest Updates | CryoSPARC
Fixed bug where running "Clear Intermediate Results" on a project caused the outputs of Curate Exposures jobs within the project to be inadvertently...
Read more >Characterizing and Detecting CUDA Program Bugs - arXiv
The design of Memory Model can be used in Simulee to detect CUDA synchronization bugs, i.e., data race, redundant barrier function, and barrier...
Read more >UNIFIED MEMORY ON P100 - OLCF
Starting with Kepler and CUDA 6. 1/11/2017 ... Code example explained. Pages allocated before they are used – cannot oversubscribe GPU.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Actually, I’ve seen a similar bug myself as well. It might have been from when I redid the GridKernel and GridInterpolationKernel. I’ll look into this
Use this branch https://github.com/cornellius-gp/gpytorch/pull/983 for now which fixes the issue. It’ll be merged in soon