Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Missing data likelihoods

See original GitHub issue

🚀 Feature Request

We’d like to use GPs in settings where some observations may be missing. My understanding is that, in these circumstances, missing observations do not contribute anything to the likelihood of the observation model.

Initial Attempt

My initial attempt to write such a likelihood is as follows:

from gpytorch.likelihoods import GaussianLikelihood
from torch.distributions import Normal

class GaussianLikelihoodWithMissingObs(GaussianLikelihood):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    @staticmethod
    def _get_masked_obs(x):
        missing_idx = x.isnan()
        x_masked = x.masked_fill(missing_idx, -999.)
        return missing_idx, x_masked

    def expected_log_prob(self, target, input, *params, **kwargs):
        missing_idx, target = self._get_masked_obs(target)
        res = super().expected_log_prob(target, input, *params, **kwargs)
        return res * ~missing_idx

    def log_marginal(self, observations, function_dist, *params, **kwargs):
        missing_idx, observations = self._get_masked_obs(observations)
        res = super().log_marginal(observations, function_dist, *params, **kwargs)
        return res * ~missing_idx

Test

import torch
import numpy as np
from tqdm import trange
from gpytorch.distributions import MultivariateNormal
from gpytorch.constraints import Interval
torch.manual_seed(42)

mu = torch.zeros(2, 3)
sigma = torch.tensor([[
        [ 1,  1-1e-7, -1+1e-7],
        [ 1-1e-7,  1, -1+1e-7],
        [-1+1e-7, -1+1e-7,  1] ]]*2).float()
mvn = MultivariateNormal(mu, sigma)
x = mvn.sample_n(10000)
# x[np.random.binomial(1, 0.1, size=x.shape).astype(bool)] = np.nan
x += np.random.normal(0, 0.5, size=x.shape)

LikelihoodOfChoice = GaussianLikelihood#WithMissingObs
likelihood = LikelihoodOfChoice(noise_constraint=Interval(1e-6, 2))

opt = torch.optim.Adam(likelihood.parameters(), lr=0.5)

bar = trange(1000)
for _ in bar:
    opt.zero_grad()
    loss = -likelihood.log_marginal(x, mvn).sum()
    loss.backward()
    opt.step()
    bar.set_description("nll: " + str(int(loss.data)))
print(likelihood.noise.sqrt()) # Test 1

likelihood.expected_log_prob(x[0], mvn) == likelihood.log_marginal(x[0], mvn) # Test 2

Test 1 outputs the correct 0.5 as expected, and Test 2 is False with LikelihoodOfChoice = GaussianLikelihood and LikelihoodOfChoice = GaussianLikelihoodWithMissingObs.

Any further tests and suggestions are appreciated. Can I open a PR for this?

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:17 (12 by maintainers)

Top GitHub Comments

1reaction

adamjstewartcommented, Jul 28, 2022

I am also interested in a MultitaskGaussianLikelihoodWithMissingObs for the same reasons listed above (not all data points contain observations for all tasks). Will try to read through the discussion above to get up to speed. @mochar you mentioned that you may have gotten this working, did you open a PR with your code?

One implementation detail I’m not sure about: in the case of each task having a different percentage of missing observations, should we normalize the loss by dividing each task by the total number of observations? It seems like each task should equally contribute to the overall loss.

1reaction

InfProbSciXcommented, Aug 31, 2021

@mochar My apologies, I haven’t been receiving notifications - I’ll have a look and get back to you.

Top Results From Across the Web

8 Handling Missing Data | Feature Engineering and Selection

Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as...

Two Recommended Solutions for Missing Data

When data are missing, we can factor the likelihood function. The likelihood is computed separately for those cases with complete data on some...

312-2012: Handling Missing Data by Maximum Likelihood

There are two major approaches to missing data that have good statistical properties: maximum likelihood (ML) and multiple imputation (MI). Multiple imputation ...

Handling “Missing Data” Like a Pro — Part 3: Model-Based ...

Impute the values for missing data using Maximum-Likelihood. Use the non-missing variables per observation to calculate the ML estimate for ...

Principled missing data methods for researchers - PMC - NCBI

It is a missing data condition in which the likelihood of missingness depends ... A larger imputation model may require more imputations, ...