Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Automatically Estimated Minimum Number of Bootstrap Samples

See original GitHub issue

Is your feature request related to a problem? Please describe. The feature request will add a valuable capability to the “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html” algorithm. Specifically, it will allow an up to two orders of magnitude reduction in the number of bootstrap samples (e.g., from the default of 10,000 down to 100). Equally importantly, this feature will also determine that a larger than than default number of bootstrap samples is needed, if the data sample requires it to be so.

Describe the solution you’d like The solution would implement an algorithm for choosing the optimal number of bootstrap samples, following the approach presented in “http://dido.econ.yale.edu/~dwka/pub/p1001.pdf”, published in the year 2000 – the full citation appears below:

@article{Andrews2000a,
  added-at = {2008-04-25T10:38:44.000+0200},
  author = {Andrews, Donald W. K. and Buchinsky, Moshe},
  biburl = {https://www.bibsonomy.org/bibtex/28e2f0a58cdb95e39659921f989a17bdd/smicha},
  day = 01,
  interhash = {778746398daa9ba63bdd95391f1efd37},
  intrahash = {8e2f0a58cdb95e39659921f989a17bdd},
  journal = {Econometrica},
  keywords = {imported},
  month = Jan,
  note = {doi: 10.1111/1468-0262.00092},
  number = 1,
  pages = {23--51},
  timestamp = {2008-04-25T10:38:52.000+0200},
  title = {A Three-step Method for Choosing the Number of Bootstrap Repetitions},
  url = {http://www.blackwell-synergy.com/doi/abs/10.1111/1468-0262.00092},
  volume = 68,
  year = 2000
}

This algorithm works as follows (as an iterative loop): a) estimate the number of bootstrap samples; b) generate these samples; c) compute statistics on this ndarray, and use them to update the estimate of the number of bootstrap samples from (a); d) repeat these steps till convergence.

Because the optimal number of bootstrap samples is taken as the maximum of the estimates up to the current iteration, the procedure converges very rapidly (often in just 3 – 5 iterations).

The initial contribution would focus on optimizing the standard error estimate of the bootstrap samples. The next effort would be devoted to adding the support for confidence intervals. After that, applicability to other statistics an be considered.

Describe alternatives you’ve considered

Using the default number of bootstrap samples (which is 9,999 according to “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html”).
Guessing the number of bootstrap samples based on preliminary size and analysis of the data set.

Additional context (e.g. screenshots) There is an initial implementation (please see https://github.com/great-expectations/great_expectations/tree/feature/GE-160/GE-237/alexsherstinsky/optimal_num_samples_bootstrapped_range_parameter_builder-2021_06_23-37/great_expectations/rule_based_profiler/estimators), which is focused on optimizing the standard error estimate of the bootstrap samples. The first task would be to determine whether or not it can be adapted to the programming style of “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html” – and, if affirmative, issue a pull request.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

mdhabercommented, Jun 29, 2021

One question is how the user will specify which they are interested in (standard error, CI, or some new output like bias) as the number of bootstrap samples will depend on that. Originally, I suggested n_resamples=None in the sense of “no fixed number”, but that would not distinguish between SE and CI. We could allow a string like n_resamples='auto-ci', and i would probably prefer that over a separate argument, but I’m not sure what’s best.

I’ll suggest a way to tie this feature into the existing code shortly.

1reaction

alexsherstinskycommented, Jun 28, 2021

Thanks for the proposal @alexsherstinsky! This sounds like a good idea that look simple to implement. Do you propose to do a PR?

@tupui Thank you for getting back to me – I would love to do a PR! I would also like to confirm with @mdhaber that I can proceed according to the plan I outlined in the feature request (since he has been helping me tremendously), or if I should alter the course of action in any way. Thank you very much.