SUMMary/ENH probability plots - design and open issues
See original GitHub issueWe have an open PR #1433 to improve the gofplots Probability plots I find it difficult to figure out what the various fit, standardization and axis label (which scaling) options are.
some notes on it:
-
- Probplots are general/generic and should work for all distributions
-
- As in GOF tests, the null hypothesis can be either fully specified or for the set of distribution, (e.g. is it N(0, 1) or is it normally distributed for any loc and scale, or in between is it normal with mean=0 and arbitrary scale)
-
- The location/scale distributions with, for example, standardized data
(x - mean) / std
is only a special case, but the most common usage.
- The location/scale distributions with, for example, standardized data
-
- probability plots can be used as “estimator”, e.g. normal case - which line fits the bulk of the observations in the qq space.
for display, one of the relevant choices are the labels/units of the axes
-
- pp-plot: straightforward, always in [0,1]
-
- qq-plot: either original units of raw data or standardized units for loc=0, scale=1 (which is not necessarily the same as standardizing by sample mean and sample variance. It looks like in 0.6 we have standardized axis lables, in PR #1433 we get original scale axis labels. I think the latter is more common and contains more information.
-
- related adjustment of plots through plotting positions.
adding lines and supporting elements
- options for lines - either fully specified location, or estimated location of line. We don’t necessarily want the line based on MLE parameters if the data is contaminated with outliers or for other reasons, i.e. we don’t always want to assume that the Null hypothesis hold for the line.
compared to R
qqnorm plots standardized x-axis (Theoretical Quantiles'), qqplot uses raw scale for x-axis (I imported
distrand
distrMod), both plot raw scale on y-axis (
Sample Quantiles’)
qqplot seems to take only “frozen”, completely specified distribution (as far as I have figured out, docstring is not very clear)
distrMod
has pointwise and simultaneous confidence bands (“exact” and asymptotic) for qqplot, e.g.
b = qqbounds(x3, Norm(15, sqrt(30)), alpha=0.95,n=30, withConf.pw = TRUE, withConf.sim = TRUE, exact.sCI=FALSE ,exact.pCI= FALSE, nosym.pCI = FALSE)
I assume it uses the assumption of no estimated parameters and fully a-priori specified distribution. Simultaneous confidence bands are based on ks_test with known parameters…
(to be continued)
background #4 merge PR looks like the early qq-plot with discussion on adding line. #412 merged PR with discussion about previous refactoring and enhancements by Paul Hobson
current issues #1433 PR #1299 initial bug issue
https://github.com/phobson/statsmodels/pull/1 other changes, with some discussion, obsolete closed #1407 #1414 disabled unit tests
Other Issues and Enhancements
- #1297 k-sample probability plot, in analogy to k-sample gof test
- #1307 chisquare probability plot for multivariate normal, extension to general elliptical distribution would require the level set distributions instead of chisquare
- #1119 confidence intervals and other extensions, see “compared to R” above
- special cases, e.g. normal plot that can use distribution specific properties - no issue ?
Issue Analytics
- State:
- Created 9 years ago
- Comments:5 (5 by maintainers)
@phobson I will look at it. I was trying out some examples in a notebook, and several cases didn’t work and raised an exception. Before thinking of the scale in the plot, I need to figure out what the different cases are. My thinking in this issue was that we want to completely separate the gof statistics computation from the display and probscaling part.
I was thinking of considering the following cases with relatively arbitrary combination, but don’t know yet how it will match up with the options and keywords. I don’t want to give up fitting or fixing loc and scale inside the class, because it will affect whether we have some confidence intervals available (e.g. kstest for simultaneous confidence intervals assumes either fully specified distribution or a loc-scale family, and will not be available otherwise without bootstrap.)
The statistically relevant part are the assumptions, the rest is mostly a display issue, i.e. what to put on the axis.
Cases by assumption fully specified loc and scale family, fixed shape general family, data dependent distribution parameters, fit
by data probabilities, cdf quantiles
by plot axis labels ppplot qqplot general axis scaling and labels
@josef-pkt
That all makes sense, especially this part:
I’m not sure what the computation of the confidence intervals would look like without bootstrapping (which is what mpl-probscale does).
Not that you asked for it, but my advice would be to reign in the scope. Specifically, I’m talking about fitting vs frozen vs generic distributions. In mpl-probscale, you specify:
plottype
(‘pp’, ‘qq’, or ‘prob’)probax
)probax
ppf
and acdf
methodIf the distribution needs to be fit, the user is responsible for that, and I’ve shown them how to do it in the docs. I would be interested in incorporating these kstest conf. intervals into mpl-probscale.
A quick search around the web doesn’t make anything obvious to me (but few things are).
Addendum A: Looks like I do let the users get the best-fit results back. You can also pass alpha and beta parameters to tweak the plotting positions.
Addendum B: I also have a parameter to pass options to the CI-estimator, though it is currently not used. (Oops)