Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a way to plot credible intervals with plot_ppc

See original GitHub issue

Tell us about it

plot_ppc is very nice, you can control the kind of plot and also the number of posterior predictive lines to draw with num_pp_samples. I would like to ask if it would be possible to add an option to plot credible intervals instead of single draws from the posterior predictive. Consider the following example:

import numpy as np
import arviz as az
import pymc3 as pm
from matplotlib import pyplot as plt

_x = np.random.uniform(-5, 5, 100)
_m = 1.5
_b = -0.7
_obs = np.random.normal(x * _m + _b, 1)

with pm.Model():
    x = pm.Data("x", x)
    m = pm.Normal("m", 0, 5)
    b = pm.Normal("b", 0, 5)
    obs = pm.Normal("obs", x * m + b, 1, observed=_obs)

    idata = pm.sample(return_inferencedata=True)
    ppc = pm.sample_posterior_predictive(idata)
    idata.extend(az.from_pymc3(posterior_predictive=ppc))

az.plot_ppc(idata);

The resulting plot is something like this

All of the individual lines from the posterior predictive samples are quite hard to read, and it’s hard to make sense of how likely it is to find a sample in a given interval.

I would like to plot something like this:

ax = az.plot_dist(idata.observed_data.to_array(), color="black", label="Observed obs")
grid = np.linspace(*ax.get_xlim(), 1000)

# Get the ppc HDI
lines = []
for line in idata.posterior_predictive.to_array().values.reshape([-1, len(_obs)]):
    lines.append(stats.gaussian_kde(line)(grid))
lines = np.array(lines)
pdf = np.mean(lines, axis=0)
hdi = az.hdi(lines, hdi_prob=0.95)
az.plot_hdi(grid, hdi_data=hdi, color="C0", ax=ax, fill_kwargs={"label": "95% HDI"})
ax.plot(grid, pdf, color="C0", linestyle="--", label="Posterior predictive mean obs")
ax.legend();

where the filled in area is the posterior predictive’s HDI at a certain level.

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

OriolAbrilcommented, Apr 10, 2021

Yes, I am testing that the whole kde line is inside the hdi shaded area, which is what I believe should be considered and the only clear and interpretable diagnostic. Again, in my opinion the ideal solution to this is using https://arxiv.org/abs/2103.10522 though, not spaghetti plots nor the hdi proposed here (nor variations on that to try and fix the hdi region to account for whole lines).

I do believe it could be useful to add this, but we have to be careful on that because it is no clear at all what exactly does the hdi shaded area represent nor how to interpret lines going outside that region.

My intuitive understanding of what an HDI for a PPC looks like is that if I have a 94% HDI, 94% of my KDE should fall inside the HDI on average if my model is solid.

I don’t think that’s true either and I am also quite sure about this. kdes are continuous lines, so the probability of the 2nd point being outside the region given the 1st one was outside is different that if the 1st one was inside, they are not independent values. Yet, we are calculating the hdi as if they were. Moreover, even if the hdi had this interpretation you mention, I don’t think having users estimate by themselves (and visually) if 95% or 90% of the kde line is outside the hdi region is a good idea.

As a side note on spaghetti plots, interpreting with an animation could be useful. Imagine the following situation. You start the animation with an spaghetti plot with 100 kde lines from the posterior predictive, then 10 more lines are added to the plot one by one, 9 come from the posterior predictive too and one is the one corresponding to the observations. You have to try and guess which is the kde of the observed data. Once the 10 lines have been added, the plot is updated to highlight the observed kde. If you guessed which one it is (and if in general anyone can guess) then your model is not reproducing the generative process correctly, if generally it’s not possible to know which is which your model is probably ok.

1reaction

OriolAbrilcommented, Apr 8, 2021

I think the ideal situation here would be to implement https://arxiv.org/abs/2103.10522 to compare several (or all) of the posterior predictive distributions to the observed one. The hdi of kde lines looks nice, but I don’t think there is any guarantee that kde lines of the same distribution will lie completely inside the shaded region with 95% probability.

We can definitely add the option for the kde hdi shaded area but we have to be careful in how we document it.

Top Results From Across the Web

Plot interval estimates from MCMC draws - Stan

Plot central (quantile-based) posterior interval estimates from MCMC draws. ... The MCMC-overview page provides details on how to specify each these.

Credible Intervals (CI) • bayestestR - GitHub Pages

Credible intervals are an important concept in Bayesian statistics. ... method = "ETI") # Plot the distribution and add the limits of the...

Add Credible Intervals to each line - Stack Overflow

The best way to calculate the 95% CI is with the function hdi(x, ci = 0.95) ('HDInterval' package). I would like to make...

Adding Confidence Intervals to Scatter Plot of Means in Excel ...

How to use a line chart at the basis for creating a "scatter" plot with custom confidence intervals around means.

Adding confidence intervals to a scatter plot in Excel 2016

How to add confidence intervals around point estimates on a "scatter" plot. A scatter plot shows the relationship between two variables, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Add a way to plot credible intervals with plot_ppc

Tell us about it

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Weird `plot_ppc()` for numpyro model with custom mixture distribution

add test for from_cmdstanpy to distinguish between vectors of length 1 and scalars