question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Automatically Estimated Minimum Number of Bootstrap Samples

See original GitHub issue

Is your feature request related to a problem? Please describe. The feature request will add a valuable capability to the “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html” algorithm. Specifically, it will allow an up to two orders of magnitude reduction in the number of bootstrap samples (e.g., from the default of 10,000 down to 100). Equally importantly, this feature will also determine that a larger than than default number of bootstrap samples is needed, if the data sample requires it to be so.

Describe the solution you’d like The solution would implement an algorithm for choosing the optimal number of bootstrap samples, following the approach presented in “http://dido.econ.yale.edu/~dwka/pub/p1001.pdf”, published in the year 2000 – the full citation appears below:

@article{Andrews2000a,
  added-at = {2008-04-25T10:38:44.000+0200},
  author = {Andrews, Donald W. K. and Buchinsky, Moshe},
  biburl = {https://www.bibsonomy.org/bibtex/28e2f0a58cdb95e39659921f989a17bdd/smicha},
  day = 01,
  interhash = {778746398daa9ba63bdd95391f1efd37},
  intrahash = {8e2f0a58cdb95e39659921f989a17bdd},
  journal = {Econometrica},
  keywords = {imported},
  month = Jan,
  note = {doi: 10.1111/1468-0262.00092},
  number = 1,
  pages = {23--51},
  timestamp = {2008-04-25T10:38:52.000+0200},
  title = {A Three-step Method for Choosing the Number of Bootstrap Repetitions},
  url = {http://www.blackwell-synergy.com/doi/abs/10.1111/1468-0262.00092},
  volume = 68,
  year = 2000
}

This algorithm works as follows (as an iterative loop): a) estimate the number of bootstrap samples; b) generate these samples; c) compute statistics on this ndarray, and use them to update the estimate of the number of bootstrap samples from (a); d) repeat these steps till convergence.

Because the optimal number of bootstrap samples is taken as the maximum of the estimates up to the current iteration, the procedure converges very rapidly (often in just 3 – 5 iterations).

The initial contribution would focus on optimizing the standard error estimate of the bootstrap samples. The next effort would be devoted to adding the support for confidence intervals. After that, applicability to other statistics an be considered.

Describe alternatives you’ve considered

Additional context (e.g. screenshots) There is an initial implementation (please see https://github.com/great-expectations/great_expectations/tree/feature/GE-160/GE-237/alexsherstinsky/optimal_num_samples_bootstrapped_range_parameter_builder-2021_06_23-37/great_expectations/rule_based_profiler/estimators), which is focused on optimizing the standard error estimate of the bootstrap samples. The first task would be to determine whether or not it can be adapted to the programming style of “https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html” – and, if affirmative, issue a pull request.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mdhabercommented, Jun 29, 2021

One question is how the user will specify which they are interested in (standard error, CI, or some new output like bias) as the number of bootstrap samples will depend on that. Originally, I suggested n_resamples=None in the sense of “no fixed number”, but that would not distinguish between SE and CI. We could allow a string like n_resamples='auto-ci', and i would probably prefer that over a separate argument, but I’m not sure what’s best.

I’ll suggest a way to tie this feature into the existing code shortly.

1reaction
alexsherstinskycommented, Jun 28, 2021

Thanks for the proposal @alexsherstinsky! This sounds like a good idea that look simple to implement. Do you propose to do a PR?

@tupui Thank you for getting back to me – I would love to do a PR! I would also like to confirm with @mdhaber that I can proceed according to the plan I outlined in the feature request (since he has been helping me tremendously), or if I should alter the course of action in any way. Thank you very much.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Rule of thumb for number of bootstrap samples
My experience is that statisticians won't take simulations or bootstraps seriously unless the number of iterations exceeds 1,000.
Read more >
FAQ: Guidelines for bootstrap samples - Stata
Are there general guidelines that have been proposed for how large the bootstrapped samples should be relative to the total number of cases...
Read more >
A Gentle Introduction to the Bootstrap Method
The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
Read more >
Lesson 9 The bootstrap | Data Science in R - Bookdown
To bootstrap, we write a computer program that repeatedly resamples our original sample and recomputes our estimate for each bootstrap sample.
Read more >
Chapter 11 The Bootstrap - CMU Statistics
Let bθn denote the sample median. Yet again, we would like to estimate the variance of bθn and we want a 1 α...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found