ENH: stats: more analytical formulas for fitting distributions to data
See original GitHub issueIn the CZI Proposal, we wrote:
The continuous distributions in SciPy all have a method for fitting the distribution to data using the method of maximum likelihood estimation (MLE). The proposed enhancements are:
- Where possible, use an analytical formula rather than numerical optimization for greater speed and accuracy …
This is part of our effort to address the roadmap item “Improve the options for fitting a probability distribution to data”.
One of the best sources I’ve found (via a NIST Engineering Statistics Handbook) is the Parameter Estimation sections from: Evans, Hastings, and Peacock (2000), Statistical Distributions, 3rd. Ed., John Wiley and Sons.
Any other sources that are easily accessible?
Also, for method of moments (gh-11695), I could write a program that tries to solve for fitting formulas (from the formulas for the moments) using SymPy. I don’t think that would work as well for MLE, though. Might not even get anything new for MM.
Given the formulas, this should be pretty straightforward work: just override the fit
methods of the distributions defined in /scipy/stats/_continuous_distns.py
, following existing examples (e.g. beta_gen
) for extending documentation and any other conventions. For tests, I suppose we could mainly compare against the generic implementation. In cases where the generic implementation isn’t working very well, we could check the partial derivatives of the likelihood function?
If you have any thoughts / words of wisdom before we get started, please let us know here.
@fletcheaston @swallan I think this is next.
Update: @swallan @WarrenWeckesser A simple but really useful script for generating first-order conditions is here. Now it can (sometimes) solve for a variable in terms of the rest.
Continuous distributions | Current status of fit override | Page Number of MLE (Evans) |
---|---|---|
laplace | merged: gh-11988, good | 124 |
rayleigh | merged: gh-12097, pending: gh-17090 | 175 |
logistic | merged: gh-12738, pending gh-17117 | 130 |
pareto | merged gh-15567, good | 149 |
invgauss, wald | merged: gh-12514, further investigation needed | 121 |
gumbel_r, gumbel_l | merged: gh-12737, good | 101 |
powerlaw | merged: gh-13053, good | 159 |
nakagami | see gh-10908 | - |
vonmises | draft: gh-13435 | 192 |
weibull_min, weibull_max | see gh-11806 | 196 |
asymmetric laplace | awaiting PR | notes |
genextreme | see gh-10446 | - |
Other distributions that have overridden fit
method:
norm
- looks good, nothing to do here
beta
- investigate fitting loc
and scale
? Adjust inequality conditions for FitDataError
.
expon
- works perfectly, AFAICT
weibull_min
- should implement analytical solution when loc
is fixed
gamma
- look into fitting when loc
is not fixed. Adjust inequality conditions for FitDataError
.
lognorm
- see gh-16839
uniform
- works perfectly, AFAICT
Issue Analytics
- State:
- Created 3 years ago
- Comments:33 (31 by maintainers)
Don’t know if this is in scope, but L-moments parameters estimates are often more robust to outliers, especially for extreme value distributions. There is an existing implementation in the
lmoments3
package, but it does not seem to be maintained anymore. Having this functionality in scipy would be great and a lot of people in the hydrological sciences would be grateful.Note that fortran code is available and googling turns up a couple of wrappers.
The table above lists the ones we had planned based on what we had found in this reference.
That said, as we went through we found that since SciPy’s parameterization always has
scale
andloc
, we frequently have to derive the conditions ourselves, and the reference just gave us hints at which distributions it’s worth trying for. I wrote a little script that uses SymPy to help a bit with that, if you’re interested. It derives the likelihood equations and solves them if it can. It’s not perfect, because it doesn’t often find the solution even when it exists, it usually writes things in a form that’s inconvenient to work with, and it doesn’t tell you what to do when the solution to a likelihood equation isn’t the MLE (e.g. when there is no solution, when it is a minimum rather than maximum, when there are other constraints to consider, etc…). It is mostly useful to quickly see how complicated the likelihood equations are so you can assess whether there might be a closed-form solution. It’s nice when you can get a closed-form solution for at least one parameter; if not, it’s better to just use the generic optimization rather than trying to solve the likelihood equations numerically, as the likelihood equations don’t consider constraints (e.g. data must lie within the support of the distribution).@swallan had you started on a PR for von Mises? If not, that would be a good one for @jjerphan. There are some notes about it above.
@lispsil said he was interested in Weibull. (gh-12787).
exponweibull
is another one I’ve seen an issue about (gh-11806).Otherwise, any distribution is fair game if it hasn’t already been done, I think. Even ones that have been done might not be implemented for all combinations of fixed and free parameters.
I wrote in gh-12787, but should copy here: