question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

[Feature Request] Missing data likelihoods

See original GitHub issue

šŸš€ Feature Request

Weā€™d like to use GPs in settings where some observations may be missing. My understanding is that, in these circumstances, missing observations do not contribute anything to the likelihood of the observation model.

Initial Attempt

My initial attempt to write such a likelihood is as follows:

from gpytorch.likelihoods import GaussianLikelihood
from torch.distributions import Normal

class GaussianLikelihoodWithMissingObs(GaussianLikelihood):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    @staticmethod
    def _get_masked_obs(x):
        missing_idx = x.isnan()
        x_masked = x.masked_fill(missing_idx, -999.)
        return missing_idx, x_masked

    def expected_log_prob(self, target, input, *params, **kwargs):
        missing_idx, target = self._get_masked_obs(target)
        res = super().expected_log_prob(target, input, *params, **kwargs)
        return res * ~missing_idx

    def log_marginal(self, observations, function_dist, *params, **kwargs):
        missing_idx, observations = self._get_masked_obs(observations)
        res = super().log_marginal(observations, function_dist, *params, **kwargs)
        return res * ~missing_idx

Test

import torch
import numpy as np
from tqdm import trange
from gpytorch.distributions import MultivariateNormal
from gpytorch.constraints import Interval
torch.manual_seed(42)

mu = torch.zeros(2, 3)
sigma = torch.tensor([[
        [ 1,  1-1e-7, -1+1e-7],
        [ 1-1e-7,  1, -1+1e-7],
        [-1+1e-7, -1+1e-7,  1] ]]*2).float()
mvn = MultivariateNormal(mu, sigma)
x = mvn.sample_n(10000)
# x[np.random.binomial(1, 0.1, size=x.shape).astype(bool)] = np.nan
x += np.random.normal(0, 0.5, size=x.shape)

LikelihoodOfChoice = GaussianLikelihood#WithMissingObs
likelihood = LikelihoodOfChoice(noise_constraint=Interval(1e-6, 2))

opt = torch.optim.Adam(likelihood.parameters(), lr=0.5)

bar = trange(1000)
for _ in bar:
    opt.zero_grad()
    loss = -likelihood.log_marginal(x, mvn).sum()
    loss.backward()
    opt.step()
    bar.set_description("nll: " + str(int(loss.data)))
print(likelihood.noise.sqrt()) # Test 1

likelihood.expected_log_prob(x[0], mvn) == likelihood.log_marginal(x[0], mvn) # Test 2

Test 1 outputs the correct 0.5 as expected, and Test 2 is False with LikelihoodOfChoice = GaussianLikelihood and LikelihoodOfChoice = GaussianLikelihoodWithMissingObs.

Any further tests and suggestions are appreciated. Can I open a PR for this?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:17 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
adamjstewartcommented, Jul 28, 2022

I am also interested in a MultitaskGaussianLikelihoodWithMissingObs for the same reasons listed above (not all data points contain observations for all tasks). Will try to read through the discussion above to get up to speed. @mochar you mentioned that you may have gotten this working, did you open a PR with your code?

One implementation detail Iā€™m not sure about: in the case of each task having a different percentage of missing observations, should we normalize the loss by dividing each task by the total number of observations? It seems like each task should equally contribute to the overall loss.

1reaction
InfProbSciXcommented, Aug 31, 2021

@mochar My apologies, I havenā€™t been receiving notifications - Iā€™ll have a look and get back to you.

Read more comments on GitHub >

github_iconTop Results From Across the Web

8 Handling Missing Data | Feature Engineering and Selection
Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as...
Read more >
Two Recommended Solutions for Missing Data
When data are missing, we can factor the likelihood function. The likelihood is computed separately for those cases with complete data on some...
Read more >
312-2012: Handling Missing Data by Maximum Likelihood
There are two major approaches to missing data that have good statistical properties: maximum likelihood (ML) and multiple imputation (MI). Multiple imputationĀ ...
Read more >
Handling ā€œMissing Dataā€ Like a Pro ā€” Part 3: Model-Based ...
Impute the values for missing data using Maximum-Likelihood. Use the non-missing variables per observation to calculate the ML estimate forĀ ...
Read more >
Principled missing data methods for researchers - PMC - NCBI
It is a missing data condition in which the likelihood of missingness depends ... A larger imputation model may require more imputations,Ā ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found