ENH: MemoryError when trying to bootstrap with large amounts of data
See original GitHub issuePlease describe your issue.
Trying to bootstrap “large” amounts of data will throw a MemoryError
, even without vectorization. Generating the drawn samples one iteration at a time would allow for larger data arrays without causing memory errors.
Expected result
Setting vectorized=False
should reduce the memory requirement, since we no longer need to create a matrix of size len(data) * n_resamples
. Drawing a single sample at a time should be enough when we don’t want to parallellize anything.
Actual result
A MemoryError
is raised for even moderately large arrays.
Step to reproduce
import scipy
import numpy as np
print(scipy.__version__)
Output:
1.7.1
data = np.random.randint(low=0, high=1, size=1_000_000)
scipy.stats.bootstrap(data=(data,) , statistic=np.var, vectorized=False)
Output:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
/tmp/ipykernel_21666/4129127229.py in <module>
1 data = np.random.randint(low=0, high=1, size=1_000_000)
----> 2 scipy.stats.bootstrap(data=(data,) , statistic=np.var, vectorized=False)
~/.local/lib/python3.8/site-packages/scipy/stats/_bootstrap.py in bootstrap(data, statistic, vectorized, paired, axis, confidence_level, n_resamples, batch, method, random_state)
431 resampled_data = []
432 for sample in data:
--> 433 resample = _bootstrap_resample(sample, n_resamples=batch_actual,
434 random_state=random_state)
435 resampled_data.append(resample)
~/.local/lib/python3.8/site-packages/scipy/stats/_bootstrap.py in _bootstrap_resample(sample, n_resamples, random_state)
50
51 # bootstrap - each row is a random resample of original observations
---> 52 i = rng_integers(random_state, 0, n, (n_resamples, n))
53
54 resamples = sample[..., i]
~/.local/lib/python3.8/site-packages/scipy/_lib/_util.py in rng_integers(gen, low, high, size, dtype, endpoint)
537
538 # exclusive
--> 539 return gen.randint(low, high=high, size=size, dtype=dtype)
540
541
mtrand.pyx in numpy.random.mtrand.RandomState.randint()
_bounded_integers.pyx in numpy.random._bounded_integers._rand_int64()
MemoryError: Unable to allocate 74.5 GiB for an array with shape (9999, 1000000) and data type int64
The issue stems from this line:
---> 52 i = rng_integers(random_state, 0, n, (n_resamples, n))
The _bootstrap_resample
(and calling) method needs to know that generating n_resamples
samples all at once isn’t necessary when vectorized=False
.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (7 by maintainers)
Top Results From Across the Web
MemoryError while reading from a large file - Stack Overflow
I am trying to process lines in a really huge file using python. ... So definitely writing large amounts of data into a...
Read more >Out of memory error from building the search index with ...
The di-buildindex utility fails due to OutOfMemoryError issues from building the search index with multiple languages.
Read more >Why Does the Out of Memory Error Occur in NodeManager ...
During the execution of Spark applications, if the YARN External Shuffle service is enabled and there are too many shuffle tasks, the java.lang....
Read more >SciPy 1.9.0 Release Notes — SciPy v1.9.3 Manual
#14645: ENH: MemoryError when trying to bootstrap with large amounts… #14716: BUG: stats: The `loguniform` distribution is overparametrized. #14731: BUG: ...
Read more >Mailman 3 ANN: SciPy 1.9.0 - NumPy-Discussion - python.org
There have been a number of deprecations and API changes in this release, ... ENH: MemoryError when trying to bootstrap with large amounts....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
See mdhaber/scipy#63 for what I meant about using a GPU. I’m seeing ~10x speedup on this. (Update: 50x using
rand
and casting instead ofrandint
)I’d need to take a more careful look to see where the bottleneck is, as I was expecting more, but it’s definitely useful.Update:
xp.rand
, multiplying is much faster, bringing the total execution time from 240ms to 53ms.cupy.scipy.special.ndtr
. The slow part is again generating the resamples - specifically areshape
operation in_jackknife_resample
. I could probably find a more efficient way to generate the jacknife resamples on the GPU.Then yeah, if you can use a GPU, it looks really easy to use CuPy as a back end for the expensive calculations - at least the part that computes the bootstrap distribution (and probably the jackknife part of BCa, if desired).
We are beginning to think about supporting this sort of thing in SciPy. It may be as simple as adding a
backend
orxp
parameter and replacing occurrences ofnp
with that. But I’ve waited to open a PR until we decide on the right way to do things, as we’ll want this sort of option in a lot of places.Thanks! It’s always good to know what this stuff is being used for.