Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: stats: more analytical formulas for fitting distributions to data

See original GitHub issue

In the CZI Proposal, we wrote:

The continuous distributions in SciPy all have a method for fitting the distribution to data using the method of maximum likelihood estimation (MLE). The proposed enhancements are:

Where possible, use an analytical formula rather than numerical optimization for greater speed and accuracy …

This is part of our effort to address the roadmap item “Improve the options for fitting a probability distribution to data”.

One of the best sources I’ve found (via a NIST Engineering Statistics Handbook) is the Parameter Estimation sections from: Evans, Hastings, and Peacock (2000), Statistical Distributions, 3rd. Ed., John Wiley and Sons.

Any other sources that are easily accessible?

Also, for method of moments (gh-11695), I could write a program that tries to solve for fitting formulas (from the formulas for the moments) using SymPy. I don’t think that would work as well for MLE, though. Might not even get anything new for MM.

Given the formulas, this should be pretty straightforward work: just override the fit methods of the distributions defined in /scipy/stats/_continuous_distns.py, following existing examples (e.g. beta_gen) for extending documentation and any other conventions. For tests, I suppose we could mainly compare against the generic implementation. In cases where the generic implementation isn’t working very well, we could check the partial derivatives of the likelihood function?

If you have any thoughts / words of wisdom before we get started, please let us know here.

@fletcheaston @swallan I think this is next.

Update: @swallan @WarrenWeckesser A simple but really useful script for generating first-order conditions is here. Now it can (sometimes) solve for a variable in terms of the rest.

Continuous distributions	Current status of fit override	Page Number of MLE (Evans)
laplace	merged: gh-11988, good	124
rayleigh	merged: gh-12097, pending: gh-17090	175
logistic	merged: gh-12738, pending gh-17117	130
pareto	merged gh-15567, good	149
invgauss, wald	merged: gh-12514, further investigation needed	121
gumbel_r, gumbel_l	merged: gh-12737, good	101
powerlaw	merged: gh-13053, good	159
nakagami	see gh-10908	-
vonmises	draft: gh-13435	192
weibull_min, weibull_max	see gh-11806	196
asymmetric laplace	awaiting PR	notes
genextreme	see gh-10446	-

Other distributions that have overridden fit method: norm - looks good, nothing to do here beta - investigate fitting loc and scale? Adjust inequality conditions for FitDataError. expon - works perfectly, AFAICT weibull_min - should implement analytical solution when loc is fixed gamma - look into fitting when loc is not fixed. Adjust inequality conditions for FitDataError. lognorm - see gh-16839 uniform - works perfectly, AFAICT

Issue Analytics

State:
Created 3 years ago
Comments:33 (31 by maintainers)

Top GitHub Comments

4reactions

huardcommented, May 29, 2020

Don’t know if this is in scope, but L-moments parameters estimates are often more robust to outliers, especially for extreme value distributions. There is an existing implementation in the lmoments3 package, but it does not seem to be maintained anymore. Having this functionality in scipy would be great and a lot of people in the hydrological sciences would be grateful.

Note that fortran code is available and googling turns up a couple of wrappers.

1reaction

mdhabercommented, Jan 17, 2021

The table above lists the ones we had planned based on what we had found in this reference.

That said, as we went through we found that since SciPy’s parameterization always has scale and loc, we frequently have to derive the conditions ourselves, and the reference just gave us hints at which distributions it’s worth trying for. I wrote a little script that uses SymPy to help a bit with that, if you’re interested. It derives the likelihood equations and solves them if it can. It’s not perfect, because it doesn’t often find the solution even when it exists, it usually writes things in a form that’s inconvenient to work with, and it doesn’t tell you what to do when the solution to a likelihood equation isn’t the MLE (e.g. when there is no solution, when it is a minimum rather than maximum, when there are other constraints to consider, etc…). It is mostly useful to quickly see how complicated the likelihood equations are so you can assess whether there might be a closed-form solution. It’s nice when you can get a closed-form solution for at least one parameter; if not, it’s better to just use the generic optimization rather than trying to solve the likelihood equations numerically, as the likelihood equations don’t consider constraints (e.g. data must lie within the support of the distribution).

@swallan had you started on a PR for von Mises? If not, that would be a good one for @jjerphan. There are some notes about it above.

@lispsil said he was interested in Weibull. (gh-12787). exponweibull is another one I’ve seen an issue about (gh-11806).

Otherwise, any distribution is fair game if it hasn’t already been done, I think. Even ones that have been done might not be implemented for all combinations of fixed and free parameters.

I wrote in gh-12787, but should copy here:

I think gh-12737 is a good example of how to override fit. It shows how different cases of user-input (e.g. loc is fixed, the others are free) may need to be treated separately - including falling back on the generic super fit method if no formula is available. We have (over the course of a few PRs) come up with a nice _check_fit_input_parameters method that should help. It also shows how the ._fitstart method is used to get initial guesses of parameters (if need be - you can override that, too, though, if you you need) and how you can use _assert_less_or_close_loglike in your tests to ensure that the analytical fit is no worse than the superclass fit. Finally, you can test the performance of your new fit override by adding it to the ContinuousFitAnalyticalMLEOverride benchmark.