NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06 when running on GPU
See original GitHub issueI get the following error: NotPSDError: Matrix not positive definite after repeatedly adding jitter up to 1.0e-06 The below code works when running on CPU, but not when I switch to GPU. Why does it only work on CPU and not on GPU too?
I am training a DGP using pytorch lightning for regression which I have constructed like this. Input dimension to the first DGP layer is 256:
class PL_model(pl.LightningModule):
def __init__(self,
batch_size,
lr,
betas,
num_samples,
num_output_dims
):
super().__init__()
# Training parameters
self.batch_size = batch_size
self.lr = lr
self.betas = betas
self.num_samples = num_samples
self.num_output_dims = num_output_dims
self.gpmodel = DeepGP(256, self.num_output_dims)
self.mll = DeepApproximateMLL(VariationalELBO(self.gpmodel.likelihood, self.gpmodel, self.batch_size))
def forward(self, x):
# compute prediction
return self.gpmodel(x)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.lr, betas=self.betas, weight_decay=1e-3)
def training_step(self, batch, batch_idx):
x, y = batch
with gpytorch.settings.num_likelihood_samples(self.num_samples):
output = self(x)
loss = -self.mll(output, y)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
with torch.no_grad():
lls = self.gpmodel.likelihood.log_marginal(y, self(x))
return -lls
def setup(self, stage=None):
dataset = load_dataset()
train_split = int(0.8 * len(dataset))
val_split = len(dataset) - train_split
self.train_set, self.val_set = random_split(dataset, [train_split, val_split])
def train_dataloader(self):
return DataLoader(self.train_set, batch_size=self.batch_size, shuffle=True)
def val_dataloader(self):
return DataLoader(self.val_set, batch_size=self.batch_size, shuffle=False)
model = PL_model(
batch_size=32,
lr=0.1,
betas=(0.85,0.89),
num_samples=3,
num_output_dims=2
)
trainer = pl.Trainer(
min_epochs=5,
max_epochs=8,
gpus=1,
logger=TensorBoardLogger("lightning_logs/", name="DGP")
)
trainer.fit(model)
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@mbelalsh it happens - it’s a known stability issue with Gaussian processes. It is a property of your data, and the fact that all computations are done in single precision.
Try switching to double precision, or using smaller learning rates on your optimizer.
@gpleiss I got the error while using the VNNGP at the prediction time. My data is already normalized.