question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: stats: more analytical formulas for fitting distributions to data

See original GitHub issue

In the CZI Proposal, we wrote:

The continuous distributions in SciPy all have a method for fitting the distribution to data using the method of maximum likelihood estimation (MLE). The proposed enhancements are:

  • Where possible, use an analytical formula rather than numerical optimization for greater speed and accuracy …

This is part of our effort to address the roadmap item “Improve the options for fitting a probability distribution to data”.

One of the best sources I’ve found (via a NIST Engineering Statistics Handbook) is the Parameter Estimation sections from: Evans, Hastings, and Peacock (2000), Statistical Distributions, 3rd. Ed., John Wiley and Sons.

Any other sources that are easily accessible?

Also, for method of moments (gh-11695), I could write a program that tries to solve for fitting formulas (from the formulas for the moments) using SymPy. I don’t think that would work as well for MLE, though. Might not even get anything new for MM.

Given the formulas, this should be pretty straightforward work: just override the fit methods of the distributions defined in /scipy/stats/_continuous_distns.py, following existing examples (e.g. beta_gen) for extending documentation and any other conventions. For tests, I suppose we could mainly compare against the generic implementation. In cases where the generic implementation isn’t working very well, we could check the partial derivatives of the likelihood function?

If you have any thoughts / words of wisdom before we get started, please let us know here.

@fletcheaston @swallan I think this is next.

Update: @swallan @WarrenWeckesser A simple but really useful script for generating first-order conditions is here. Now it can (sometimes) solve for a variable in terms of the rest.

Continuous distributions Current status of fit override Page Number of MLE (Evans)
laplace merged: gh-11988, good 124
rayleigh merged: gh-12097, pending: gh-17090 175
logistic merged: gh-12738, pending gh-17117 130
pareto merged gh-15567, good 149
invgauss, wald merged: gh-12514, further investigation needed 121
gumbel_r, gumbel_l merged: gh-12737, good 101
powerlaw merged: gh-13053, good 159
nakagami see gh-10908 -
vonmises draft: gh-13435 192
weibull_min, weibull_max see gh-11806 196
asymmetric laplace awaiting PR notes
genextreme see gh-10446 -

Other distributions that have overridden fit method: norm - looks good, nothing to do here beta - investigate fitting loc and scale? Adjust inequality conditions for FitDataError. expon - works perfectly, AFAICT weibull_min - should implement analytical solution when loc is fixed gamma - look into fitting when loc is not fixed. Adjust inequality conditions for FitDataError. lognorm - see gh-16839 uniform - works perfectly, AFAICT

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:33 (31 by maintainers)

github_iconTop GitHub Comments

4reactions
huardcommented, May 29, 2020

Don’t know if this is in scope, but L-moments parameters estimates are often more robust to outliers, especially for extreme value distributions. There is an existing implementation in the lmoments3 package, but it does not seem to be maintained anymore. Having this functionality in scipy would be great and a lot of people in the hydrological sciences would be grateful.

Note that fortran code is available and googling turns up a couple of wrappers.

1reaction
mdhabercommented, Jan 17, 2021

The table above lists the ones we had planned based on what we had found in this reference.

That said, as we went through we found that since SciPy’s parameterization always has scale and loc, we frequently have to derive the conditions ourselves, and the reference just gave us hints at which distributions it’s worth trying for. I wrote a little script that uses SymPy to help a bit with that, if you’re interested. It derives the likelihood equations and solves them if it can. It’s not perfect, because it doesn’t often find the solution even when it exists, it usually writes things in a form that’s inconvenient to work with, and it doesn’t tell you what to do when the solution to a likelihood equation isn’t the MLE (e.g. when there is no solution, when it is a minimum rather than maximum, when there are other constraints to consider, etc…). It is mostly useful to quickly see how complicated the likelihood equations are so you can assess whether there might be a closed-form solution. It’s nice when you can get a closed-form solution for at least one parameter; if not, it’s better to just use the generic optimization rather than trying to solve the likelihood equations numerically, as the likelihood equations don’t consider constraints (e.g. data must lie within the support of the distribution).

@swallan had you started on a PR for von Mises? If not, that would be a good one for @jjerphan. There are some notes about it above.

@lispsil said he was interested in Weibull. (gh-12787). exponweibull is another one I’ve seen an issue about (gh-11806).

Otherwise, any distribution is fair game if it hasn’t already been done, I think. Even ones that have been done might not be implemented for all combinations of fixed and free parameters.

I wrote in gh-12787, but should copy here:

I think gh-12737 is a good example of how to override fit. It shows how different cases of user-input (e.g. loc is fixed, the others are free) may need to be treated separately - including falling back on the generic super fit method if no formula is available. We have (over the course of a few PRs) come up with a nice _check_fit_input_parameters method that should help. It also shows how the ._fitstart method is used to get initial guesses of parameters (if need be - you can override that, too, though, if you you need) and how you can use _assert_less_or_close_loglike in your tests to ensure that the analytical fit is no worse than the superclass fit. Finally, you can test the performance of your new fit override by adding it to the ContinuousFitAnalyticalMLEOverride benchmark.

Read more comments on GitHub >

github_iconTop Results From Across the Web

3.3: Fitting a Distribution Function to Data
This section discusses the use of data in determining the distribution function to use to model a random quantity as well as values...
Read more >
Simulation Tutorial - Fitting Distributions - Solver.com
To fit a distribution to data in your spreadsheet, simply select the data and click ... analytic moments, and goodness-of-fit statistic in the...
Read more >
Using Goodness-of Fit Statistics to optimize Distribution Fitting
Goodness of fit statistics can be used with a linear optimiser to find the parameters that produce the closest fit of a distribution...
Read more >
Deciding Which Distribution Fits Your Data Best - SPC for Excel
This statistic is used to help determine how good the fit is. The link above for the normal probability plot shows how the...
Read more >
Normal Probability Distribution - an overview
The single most important distribution in probability and statistics is the ... so they are one obvious way to assess if data fit...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found