Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: MemoryError when trying to bootstrap with large amounts of data

See original GitHub issue

Please describe your issue.

Trying to bootstrap “large” amounts of data will throw a MemoryError, even without vectorization. Generating the drawn samples one iteration at a time would allow for larger data arrays without causing memory errors.

Expected result Setting vectorized=False should reduce the memory requirement, since we no longer need to create a matrix of size len(data) * n_resamples. Drawing a single sample at a time should be enough when we don’t want to parallellize anything.

Actual result A MemoryError is raised for even moderately large arrays.

Step to reproduce

import scipy
import numpy as np
print(scipy.__version__)

Output:

1.7.1

data = np.random.randint(low=0, high=1, size=1_000_000)
scipy.stats.bootstrap(data=(data,) , statistic=np.var, vectorized=False)

Output:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
/tmp/ipykernel_21666/4129127229.py in <module>
      1 data = np.random.randint(low=0, high=1, size=1_000_000)
----> 2 scipy.stats.bootstrap(data=(data,) , statistic=np.var, vectorized=False)

~/.local/lib/python3.8/site-packages/scipy/stats/_bootstrap.py in bootstrap(data, statistic, vectorized, paired, axis, confidence_level, n_resamples, batch, method, random_state)
    431         resampled_data = []
    432         for sample in data:
--> 433             resample = _bootstrap_resample(sample, n_resamples=batch_actual,
    434                                            random_state=random_state)
    435             resampled_data.append(resample)

~/.local/lib/python3.8/site-packages/scipy/stats/_bootstrap.py in _bootstrap_resample(sample, n_resamples, random_state)
     50 
     51     # bootstrap - each row is a random resample of original observations
---> 52     i = rng_integers(random_state, 0, n, (n_resamples, n))
     53 
     54     resamples = sample[..., i]

~/.local/lib/python3.8/site-packages/scipy/_lib/_util.py in rng_integers(gen, low, high, size, dtype, endpoint)
    537 
    538         # exclusive
--> 539         return gen.randint(low, high=high, size=size, dtype=dtype)
    540 
    541 

mtrand.pyx in numpy.random.mtrand.RandomState.randint()

_bounded_integers.pyx in numpy.random._bounded_integers._rand_int64()

MemoryError: Unable to allocate 74.5 GiB for an array with shape (9999, 1000000) and data type int64

The issue stems from this line:

---> 52     i = rng_integers(random_state, 0, n, (n_resamples, n))

The _bootstrap_resample (and calling) method needs to know that generating n_resamples samples all at once isn’t necessary when vectorized=False.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

mdhabercommented, Dec 6, 2021

See mdhaber/scipy#63 for what I meant about using a GPU. I’m seeing ~10x speedup on this. (Update: 50x using rand and casting instead of randint)

import numpy as np
import cupy as cp
from scipy import stats

data_np = np.random.rand(10000)
rng_np = np.random.RandomState(0)
res_np = stats.bootstrap((data_np,), np.std, batch=1000,
                         random_state=rng_np, xp=np)  # 2.8 s ± 9.26 ms per loop 

data_cp = cp.array(data_np)
rng_cp = cp.random.RandomState(0)
res_cp = stats.bootstrap((data_cp,), cp.std, batch=1000, 
                         random_state=rng_cp, xp=cp)  # 240 ms ± 518 µs per loop

~~I’d need to take a more careful look to see where the bottleneck is, as I was expecting more, but it’s definitely useful.~~

Update:

Computing the statistic is ~90x faster, which is more in line with what I was expecting (~1% of the total time now)
Generating the resamples is only ~4x faster, which is the bottleneck (~80% of the total time). Looking deeper, 97.2% of that time is the random number generation; only 2.8% is the indexing. This is a known issue: cupy/cupy#4120. Using xp.rand, multiplying is much faster, bringing the total execution time from 240ms to 53ms.
In this case, calculating the BCa interval is ~10x faster, but that’s (~18% of the total time). Again, actually computing the statistic is very fast. There is some overhead in using the GPU for the tiny calculations, like calculating the statistic for the observed data (only) and the use of cupy.scipy.special.ndtr. The slow part is again generating the resamples - specifically a reshape operation in _jackknife_resample. I could probably find a more efficient way to generate the jacknife resamples on the GPU.

0reactions

mdhabercommented, Aug 30, 2021

Yes, we’re using regular vectorized numpy functions for the actual use case

Then yeah, if you can use a GPU, it looks really easy to use CuPy as a back end for the expensive calculations - at least the part that computes the bootstrap distribution (and probably the jackknife part of BCa, if desired).

We are beginning to think about supporting this sort of thing in SciPy. It may be as simple as adding a backend or xp parameter and replacing occurrences of np with that. But I’ve waited to open a PR until we decide on the right way to do things, as we’ll want this sort of option in a lot of places.

We’ve got one KPI in particular…

Thanks! It’s always good to know what this stuff is being used for.

Top Results From Across the Web

MemoryError while reading from a large file - Stack Overflow

I am trying to process lines in a really huge file using python. ... So definitely writing large amounts of data into a...

Out of memory error from building the search index with ...

The di-buildindex utility fails due to OutOfMemoryError issues from building the search index with multiple languages.

Why Does the Out of Memory Error Occur in NodeManager ...

During the execution of Spark applications, if the YARN External Shuffle service is enabled and there are too many shuffle tasks, the java.lang....

SciPy 1.9.0 Release Notes — SciPy v1.9.3 Manual

#14645: ENH: MemoryError when trying to bootstrap with large amounts… #14716: BUG: stats: The `loguniform` distribution is overparametrized. #14731: BUG: ...

Mailman 3 ANN: SciPy 1.9.0 - NumPy-Discussion - python.org

There have been a number of deprecations and API changes in this release, ... ENH: MemoryError when trying to bootstrap with large amounts....