Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SUMMary/ENH probability plots - design and open issues

See original GitHub issue

We have an open PR #1433 to improve the gofplots Probability plots I find it difficult to figure out what the various fit, standardization and axis label (which scaling) options are.

some notes on it:

1. Probplots are general/generic and should work for all distributions
1. As in GOF tests, the null hypothesis can be either fully specified or for the set of distribution, (e.g. is it N(0, 1) or is it normally distributed for any loc and scale, or in between is it normal with mean=0 and arbitrary scale)
1. The location/scale distributions with, for example, standardized data (x - mean) / std is only a special case, but the most common usage.
1. probability plots can be used as “estimator”, e.g. normal case - which line fits the bulk of the observations in the qq space.

for display, one of the relevant choices are the labels/units of the axes

1. pp-plot: straightforward, always in [0,1]
1. qq-plot: either original units of raw data or standardized units for loc=0, scale=1 (which is not necessarily the same as standardizing by sample mean and sample variance. It looks like in 0.6 we have standardized axis lables, in PR #1433 we get original scale axis labels. I think the latter is more common and contains more information.
1. related adjustment of plots through plotting positions.

adding lines and supporting elements

options for lines - either fully specified location, or estimated location of line. We don’t necessarily want the line based on MLE parameters if the data is contaminated with outliers or for other reasons, i.e. we don’t always want to assume that the Null hypothesis hold for the line.

compared to R qqnorm plots standardized x-axis (Theoretical Quantiles'), qqplot uses raw scale for x-axis (I importeddistranddistrMod), both plot raw scale on y-axis (Sample Quantiles’) qqplot seems to take only “frozen”, completely specified distribution (as far as I have figured out, docstring is not very clear)

distrMod has pointwise and simultaneous confidence bands (“exact” and asymptotic) for qqplot, e.g. b = qqbounds(x3, Norm(15, sqrt(30)), alpha=0.95,n=30, withConf.pw = TRUE, withConf.sim = TRUE, exact.sCI=FALSE ,exact.pCI= FALSE, nosym.pCI = FALSE) I assume it uses the assumption of no estimated parameters and fully a-priori specified distribution. Simultaneous confidence bands are based on ks_test with known parameters…

(to be continued)

background #4 merge PR looks like the early qq-plot with discussion on adding line. #412 merged PR with discussion about previous refactoring and enhancements by Paul Hobson

current issues #1433 PR #1299 initial bug issue

https://github.com/phobson/statsmodels/pull/1 other changes, with some discussion, obsolete closed #1407 #1414 disabled unit tests

Other Issues and Enhancements

#1297 k-sample probability plot, in analogy to k-sample gof test
#1307 chisquare probability plot for multivariate normal, extension to general elliptical distribution would require the level set distributions instead of chisquare
#1119 confidence intervals and other extensions, see “compared to R” above
special cases, e.g. normal plot that can use distribution specific properties - no issue ?

Issue Analytics

State:
Created 9 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

josef-pktcommented, Mar 1, 2017

@phobson I will look at it. I was trying out some examples in a notebook, and several cases didn’t work and raised an exception. Before thinking of the scale in the plot, I need to figure out what the different cases are. My thinking in this issue was that we want to completely separate the gof statistics computation from the display and probscaling part.

I was thinking of considering the following cases with relatively arbitrary combination, but don’t know yet how it will match up with the options and keywords. I don’t want to give up fitting or fixing loc and scale inside the class, because it will affect whether we have some confidence intervals available (e.g. kstest for simultaneous confidence intervals assumes either fully specified distribution or a loc-scale family, and will not be available otherwise without bootstrap.)

The statistically relevant part are the assumptions, the rest is mostly a display issue, i.e. what to put on the axis.

Cases by assumption fully specified loc and scale family, fixed shape general family, data dependent distribution parameters, fit

by data probabilities, cdf quantiles

by plot axis labels ppplot qqplot general axis scaling and labels

0reactions

phobsoncommented, Mar 1, 2017

@josef-pkt

That all makes sense, especially this part:

My thinking in this issue was that we want to completely separate the gof statistics computation from the display and probscaling part.

I’m not sure what the computation of the confidence intervals would look like without bootstrapping (which is what mpl-probscale does).

Not that you asked for it, but my advice would be to reign in the scope. Specifically, I’m talking about fitting vs frozen vs generic distributions. In mpl-probscale, you specify:

the plottype (‘pp’, ‘qq’, or ‘prob’)
which axis shows the data
which axis shows the probabilities or quantiles (‘x’ or ‘y’, called probax)
What label goes on the probax
What label goes on the other ax
your distribution object with a ppf and a cdf method
if you want a line fit (i don’t return stats)
if you want a 95% conf interval band (again, just a visual option)

If the distribution needs to be fit, the user is responsible for that, and I’ve shown them how to do it in the docs. I would be interested in incorporating these kstest conf. intervals into mpl-probscale.

A quick search around the web doesn’t make anything obvious to me (but few things are).

Addendum A: Looks like I do let the users get the best-fit results back. You can also pass alpha and beta parameters to tweak the plotting positions.

Addendum B: I also have a parameter to pass options to the CI-estimator, though it is currently not used. (Oops)

Top Results From Across the Web

add chisquare probability plot for multivariate normal #1307

add chisquare probability plot for multivariate normal #1307. Open ... SUMMary/ENH probability plots - design and open issues #2183. Open.

A Simple Guide to Probability Plots - Minitab Blog

In this post, I intend to present the main principles of probability plots and focus on their visual interpretation using some real data....

Explaining probability plots - Towards Data Science

When I started creating some P-P plots using statsmodels I noticed an issue — as I was comparing random draws from N(1, 2.5)...

4.6 - Normal Probability Plot of Residuals | STAT 501

Here's a screencast illustrating a theoretical p th percentile. The problem is that to determine the percentile value of a normal distribution, you...

Normal Probability Plots Explained (OpenIntro textbook ...

Our accompanying textbooks on http://openintro.org/books, all of which are free to download. Hard copies are also priced to be affordable ...