Automatically Estimated Minimum Number of Bootstrap Samples
See original GitHub issueIs your feature request related to a problem? Please describe. The feature request will add a valuable capability to the “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html” algorithm. Specifically, it will allow an up to two orders of magnitude reduction in the number of bootstrap samples (e.g., from the default of 10,000 down to 100). Equally importantly, this feature will also determine that a larger than than default number of bootstrap samples is needed, if the data sample requires it to be so.
Describe the solution you’d like The solution would implement an algorithm for choosing the optimal number of bootstrap samples, following the approach presented in “http://dido.econ.yale.edu/~dwka/pub/p1001.pdf”, published in the year 2000 – the full citation appears below:
@article{Andrews2000a,
added-at = {2008-04-25T10:38:44.000+0200},
author = {Andrews, Donald W. K. and Buchinsky, Moshe},
biburl = {https://www.bibsonomy.org/bibtex/28e2f0a58cdb95e39659921f989a17bdd/smicha},
day = 01,
interhash = {778746398daa9ba63bdd95391f1efd37},
intrahash = {8e2f0a58cdb95e39659921f989a17bdd},
journal = {Econometrica},
keywords = {imported},
month = Jan,
note = {doi: 10.1111/1468-0262.00092},
number = 1,
pages = {23--51},
timestamp = {2008-04-25T10:38:52.000+0200},
title = {A Three-step Method for Choosing the Number of Bootstrap Repetitions},
url = {http://www.blackwell-synergy.com/doi/abs/10.1111/1468-0262.00092},
volume = 68,
year = 2000
}
This algorithm works as follows (as an iterative loop):
a) estimate the number of bootstrap samples;
b) generate these samples;
c) compute statistics on this ndarray
, and use them to update the estimate of the number of bootstrap samples from (a);
d) repeat these steps till convergence.
Because the optimal number of bootstrap samples is taken as the maximum of the estimates up to the current iteration, the procedure converges very rapidly (often in just 3 – 5 iterations).
The initial contribution would focus on optimizing the standard error estimate of the bootstrap samples. The next effort would be devoted to adding the support for confidence intervals. After that, applicability to other statistics an be considered.
Describe alternatives you’ve considered
- Using the default number of bootstrap samples (which is 9,999 according to “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html”).
- Guessing the number of bootstrap samples based on preliminary size and analysis of the data set.
Additional context (e.g. screenshots) There is an initial implementation (please see https://github.com/great-expectations/great_expectations/tree/feature/GE-160/GE-237/alexsherstinsky/optimal_num_samples_bootstrapped_range_parameter_builder-2021_06_23-37/great_expectations/rule_based_profiler/estimators), which is focused on optimizing the standard error estimate of the bootstrap samples. The first task would be to determine whether or not it can be adapted to the programming style of “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html” – and, if affirmative, issue a pull request.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
One question is how the user will specify which they are interested in (standard error, CI, or some new output like bias) as the number of bootstrap samples will depend on that. Originally, I suggested
n_resamples=None
in the sense of “no fixed number”, but that would not distinguish between SE and CI. We could allow a string liken_resamples='auto-ci'
, and i would probably prefer that over a separate argument, but I’m not sure what’s best.I’ll suggest a way to tie this feature into the existing code shortly.
@tupui Thank you for getting back to me – I would love to do a PR! I would also like to confirm with @mdhaber that I can proceed according to the plan I outlined in the feature request (since he has been helping me tremendously), or if I should alter the course of action in any way. Thank you very much.