question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`_cdf_cvm` inaccurate for small `n`

See original GitHub issue

The documentation of _cdf_cvm notes:

The function is not expected to be accurate for large values of x, say x > 2, when the cdf is very close to 1 and it might return values > 1 in that case, e.g. _cdf_cvm(2.0, 12) = 1.0000027556716846.

However, in addition to that, the function is inaccurate for small values of n, in particular 2. Below approx. 0.042 the value drops to 0, and above approx. 0.666 it jumps to 1. To a smaller extent, similar things happen for other small n.

I don’t know whether this can be fixed, but it should be documented.

Reproducing code example:

>>> _cdf_cvm([0.041, 0.042, 0.666, 0.667], n=2)
array([0.      , 0.031081, 0.993724, 1.      ])

Scipy/Numpy/Python version information:

1.6.3 1.20.2 sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)

Plots

bokeh_plot bokeh_plot

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
chrisb83commented, Jul 15, 2021

Hi,

if you check the second reference in the documenation, you will find some more details on the approximation:

Csorgo, S. and Faraway, J. (1996). The Exact and Asymptotic Distribution of Cramér-von Mises Statistics. Journal of the Royal Statistical Society, pp. 221-234. => you can find it on the website https://people.bath.ac.uk/jjf23/papers/index.html

Below approx. 0.042 the value drops to 0, and above approx. 0.666 it jumps to 1

That is correct, the support of the exact distribution is contained in [1/(12*n), n/3], see Eq 1.1 in the article

You can take a look at Table 1 in the article, it contains a comparison of the exact values and the approximation. And indeed, for n=2, the approximation is not great but that is not a problem with the implementation but rather the way the approximation works (error is about 1/n**2, see Eq 1.8)

So if you need more accurate values for small n, there are other approaches you could use, they are briefly mentioned in the article, too. I hope it helps…

0reactions
allefeldcommented, Jul 21, 2021

I’ve created a PR, please check.

@chrisb83:

I don’t understand the problems of the approximation well enough to say whether it would be feasible to make a more precise implementation. I suspect that would entail numerical calculations that are too expensive for the function itself, so one would probably have to pre-calculate a tabulation and then interpolate.

I agree that there is no acute problem for the GoF test, because it is unlikely that anyone would do such a test on a sample of 2 or similar, and even then at the conventional level of 0.05 one would have to Bonferroni-correct for several tests in order to reach the values where the approximation fails relevantly.

I would still appreciate if the function (either the current approximation or something better) could be exposed publicly. scipy.special seems like a good place for that.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Problems with small sample sizes | basic statistics
In column 4, samples with n = 20 and n = 5 are very misleading. figure1. Small sample size could be less of...
Read more >
Best Practices for Using Statistics on Small Sample Sizes
Some people think that if you have a small sample size you can't use statistics. Put simply, this is wrong, but it's a...
Read more >
Practical Semiparametric Inference With Bayesian ...
the predictive accuracy of each candidate model hk varies across space and time, and there may exist systematic prediction bias among all hk's....
Read more >
Too Much or Not Enough: Sample Sizes and Statistical Analysis
With small sample sizes, the increase in false negative rates can give a sense of complacency and lead to no process improvement.
Read more >
False-positive findings and how to minimize them
In his recent paper Minimizing the changes of false positives and false negatives, Curran-Everett demonstrates that the chance of obtaining ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found