Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: power_divergence is raising on master when it does not in 1.6 or earlier

See original GitHub issue

My issue is about a bug appearing in the unreleased SciPy 1.7.0.dev0+9b9f2e8. This is appearing in the statsmodels pip-pre run.

Reproducing code example:

from numpy import array
from scipy import stats


f_obs = array([44, 74, 48, 24,  8,  2])
f_exp = array([43.93623589, 71.60431015, 52.20297275, 23.22015199,  7.16316534,
        1.59334208])
chi2 = stats.chisquare(f_obs , f_exp )

Error message:

>               raise ValueError(msg)
E  ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected 
E frequencies to a relative tolerance of 1e-08, but the percent differences are:
E  0.0014010692380163618

Scipy/Numpy/Python version information:

Issue Analytics

State:
Created 3 years ago
Comments:34 (20 by maintainers)

Top GitHub Comments

1reaction

rkerncommented, Nov 8, 2022

Your implementation is just incorrect. Even in the book that you cite, it deals with the truncation. I also think you are missing a (1-exp) term. See the section on the Gap test in Knuth’s Seminumerical Algorithms, probably the most authoritative source, for the correct formulae.

1reaction

josef-pktcommented, Nov 8, 2022

A guess based on a quick code check

The skidmarks.gap_test function truncates the array of expected observations. So, AFAICS, egaps should not add up to one

    egaps = [l * (exp ** ii) for ii in range(1, len(ogaps) + 1)]
    chi, pval = chisquare(np.array(ogaps), np.array(egaps))

In statsmodels I had a similar problem. I truncated expected counts eg. for poisson, which was negligible before the change in scipy and raised after for some cases. My fix, AFAIR, was to add the missing probability to the last count, which also makes sense for chisquare test if truncation is even of a nonnegligible amount.

Also chisquare test requires that expected count in each cell is large enough. So relying on tiny cells is not good for pvalues either.