Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpectedly poor results when distribution fitting with `weibull_min` and `exponweib`

See original GitHub issue

My issue is about distribution fitting with weibull_min and exponweib returning clearly incorrect results for shape and scale parameters.

Full details here: https://stats.stackexchange.com/questions/458652/scipy-stats-failing-to-fit-weibull-distribution-unless-location-parameter-is-con

import numpy as np
import pandas as pd
from scipy import stats

x = [4836.6, 823.6, 3131.7, 1343.4, 709.7, 610.6, 
     3034.2, 1973, 7358.5, 265, 4590.5, 5440.4, 4613.7, 4763.1, 
     115.3, 5385.1, 6398.1, 8444.6, 2397.1, 3259.7, 307.5, 4607.4, 
     6523.7, 600.3, 2813.5, 6119.8, 6438.8, 2799.1, 2849.8, 5309.6, 
     3182.4, 705.5, 5673.3, 2939.9, 2631.8, 5002.1, 1967.3, 2810.4,
     2948, 6904.8]

stats.weibull_min.fit(x)

Here are the results:

shape, loc, scale = (0.1102610560437356, 115.29999999999998, 3.428664764594809)

This is clearly a very poor fit to the data. I am aware that by constraining the loc parameter to zero, I can get better results, but why should this be necessary? Shouldn’t the unconstrained fit be more likely to overfit the data than to dramatically under-fit?

And what if I want to estimate the location parameter without constraint - why should that return such unexpected results for the shape and scale parameters?

Scipy/Numpy/Python version information:

import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)

1.4.1 1.18.1 sys.version_info(major=3, minor=6, micro=10, releaselevel='final', serial=0)

Issue Analytics

State:
Created 3 years ago
Comments:50 (32 by maintainers)

Top GitHub Comments

3reactions

pbrodcommented, Apr 20, 2020

A good start value is crucial for the ML method to find a good solution, especially if many parameters are to be estimated. So if weibul_min had a better _fitstart function like this:

>>> import numpy as np

>>> def _fitstart(data): 
...      """Reference: Cohen & Whittle, (1988) "Parameter Estimation in Reliability
...           and Life Span Models", p. 25 ff, Marcel Dekker
...      """
...      loc = data.min() - 0.01  #*np.std(data)  # alternatively subtract 0.01*stdev
...      chat = 1. / (6 ** (1 / 2) / np.pi * np.std(np.log(data - loc)))
...      scale = np.mean((data - loc) ** chat) ** (1. / chat)
...      return chat, loc, scale

it would solve many situations of unexpectedly poor results when fitting the weibul_min distribution so the user wouldn’t have to fiddle to finetune the fitting. Below I have compared the start values obtained from the default stats.weibull_min._fitstart method the method above. And it is clear that that the bad startvalue for the parameters are the cause of this poor fit:


>>> from scipy import stats

>>> x = [4836.6, 823.6, 3131.7, 1343.4, 709.7, 610.6, 
...     3034.2, 1973, 7358.5, 265, 4590.5, 5440.4, 4613.7, 4763.1, 
...     115.3, 5385.1, 6398.1, 8444.6, 2397.1, 3259.7, 307.5, 4607.4, 
...     6523.7, 600.3, 2813.5, 6119.8, 6438.8, 2799.1, 2849.8, 5309.6, 
...     3182.4, 705.5, 5673.3, 2939.9, 2631.8, 5002.1, 1967.3, 2810.4,
...     2948, 6904.8]

>>> data = np.asarray(x)
>>> _fitstart(data)
...  (0.5899119992937546, 115.28999999999999, 3060.830920639086)
>>> stats.weibull_min._fitstart(data)
...  (1.0, -717.6300000000002, 1)

>>> chat, loc, scale =_fitstart(data)
>>> stats.weibull_min.fit(x, chat, loc=loc, scale=scale)
(1.7760067130728838, -322.09255199577206, 4355.262678526975)

2reactions

mdhabercommented, Feb 23, 2021

Kidding aside, if “loc = x” had a little more influence that would be a good thing.

You mean if providing the guess loc=x was a stronger hint, it would be preferable? I see. Even if you provide loc=0, SciPy finds the crazy solution. But perhaps that is because there is not a local minimum of the objective function for it to settle into that is near loc=0.

I’m not sure if we can bake what you’re looking for into the fit method. Maximum Likelihood Estimation is a specific way of fitting a distribution to data, and I think SciPy is doing a reasonable job of that here. You may want to define your own objective function that includes a mathematical description of sanity : )

I’m only partially joking. It’s really not too tough to fit using minimize directly. Assuming you’ve already run the code above:

from scipy.optimize import minimize
def ll(params, data):
    # negative of log likelihood function as we're minimizing
    return -np.sum(np.log(weibull_min.pdf(x, *params)))

res = minimize(ll, (6.4, 0, 21.6), args=(x,))
c3, u3, s3 = res.x
pdf3 = weibull_min.pdf(q, c3, loc=u3, scale=s3)

plt.plot(q, pdf, '-', q, pdf2, '--', q, pdf3, '-')
plt.hist(x, density=True, alpha=0.5)
plt.title(f"weibull_min fits")
plt.legend((f"pdf(c={c:.2}, loc={u:.2}, scale={s:.2})",
            f"pdf(c={c2:.3}, loc={u2}, scale={s2:.3})",
            f"pdf(c={c3:.3}, loc={u3:.2}, scale={s3:.3})",
            "observations"))

produces

This is not much better, but the point is that now you can change the objective function or add constraints to get something closer to what you’re looking for. (It’s just not maximum likelihood estimation anymore.)

Top Results From Across the Web

scipy.stats failing to fit Weibull distribution unless location ...

I fit a Weibull distribution in R using the {fitdistrplus} package, and get back reasonable results for shape and scale parameters.

python 3.x - Does fitting Weibull distribution to data using scipy ...

However, I noticed poor performance of scipy.stats library while doing so. So, I took a different direction and checked the fit performance by...

Fitting A Weibull Distribution Using Scipy - ADocLib

In the CZI Proposal we wrote: The continuous distributions in SciPy all have a Unexpectedly poor results when distribution fitting with weibullmin and....

scipy.stats.weibull_min — SciPy v1.9.3 Manual

It arises as the limiting distribution of the rescaled minimum of iid random variables. As an instance of the rv_continuous class, weibull_min object...

Fusion Learning: A One Shot Federated Learning - PMC - NCBI

These are sent only once, thereby requiring only one communication round. The server generates artificial data using the distribution parameters ...