Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: random_state using scipy.stats.qmc engines

See original GitHub issue

This is a follow up of discussions in #10844.

Describe the solution you’d like All sampling methods in scipy.stats are using random_state which is a np.random.Generator (for new code). But this numpy generator is not aware of dimensions. Also, the new scipy.stats.qmc allow to generate samples efficiently in n-dimensions. It would be nice to bridge the gap between the two.

Currently there is a qmc.MultivariateNormalQMC. Instead of duplicating this for other distributions, a solution could be to make the QMC engines inherit from np.random.BitGenerator. We could then use the new QMC engines with all the existing distributions.

Describe alternatives you’ve considered

Solution using a BitGenerator: Seemingly, it must be done in Cython. Otherwise I saw that there was a wrapper around BitGenerator, but this is not available in numpy… https://bashtage.github.io/randomgen/bit_generators/userbitgenerator.html from @bashtage. This is working but the underlying numpy code is not aware of dimensions (https://github.com/numpy/numpy/blob/e4feb7027e397925d220a10dd58b581b87ca1fec/numpy/random/_generator.pyx#L3562-L3568).

from numpy.random import Generator
from randomgen import UserBitGenerator


class SobolNP:
    def __init__(self, state):
        self._next_64 = None
        self.engine = Sobol(d=1, scramble=False, seed=state)

    def random_raw(self):
        """Generate the next "raw" value, which is 64 bits"""
        return int(self.engine.random() * 2**64)  # although Sobol uses 30 bits...

    @property
    def next_64(self):
        def _next_64(void_p):
            return self.random_raw()

        self._next_64 = _next_64
        return _next_64


# examples
prng = SobolNP(1234)
sobol_bit_generator = UserBitGenerator(prng.next_64, 64)
gen = Generator(sobol_bit_generator)
gen.random(8)

Another solution would be to mock np.random.Generator: I am using __getattr__ to mock calls to the distributions. So calls like random_state.uniform(...). Seems to be working and n-dimensions is ok too.

from numpy.random import Generator
import scipy.stats as stats
from functools import partial


class ScipyGenerator:

    @property
    def __class__(self):
        return Generator

    def __init__(self, d):
        self.qrng = stats.qmc.Sobol(d=d, scramble=False)

    def rvs(self, *args, dist, **kwargs):
        args = list(args)
        size = kwargs.pop('size', None)
        if size is None:
            size = args.pop(0)
        sample = self.qrng.random(size)
        return dist(*args, **kwargs).ppf(sample)

    def __getattr__(self, attr):
        try:
            dist = getattr(stats, attr)
            return partial(self.rvs, dist=dist)
        except AttributeError as err_np:
            raise err_np


# examples
random_state = ScipyGenerator(d=2)
isinstance(random_state, Generator)
random_state.uniform(8)
random_state.gamma(2, size=8)

Issue Analytics

State:
Created 3 years ago
Comments:24 (24 by maintainers)

Top GitHub Comments

1reaction

mdhabercommented, Aug 28, 2021

It seems that we have two big ideas for stats maintenance these days:

Overhaul of the distribution superclasses
Overhaul of (almost) everything else (other than the major new enhancements: QMC and fast RVS)

After a year of pretty intense work on scipy.stats, I’ve become interested in both, but I’ve prioritized #2. The resampling method PRs (bootstrap, permutation test, and Monte Carlo tests) and gh-13312 have actually all been part of that effort. Besides applying the _axis_nan_policy decorator to existing functions and providing the resampling methods to expand the functionality of the statistics and tests we already have, my plan has been to create a class that makes writing a typical hypothesis test as simple as implementing a 1D statistic (and, if the null hypothesis is not that of independence, possibly defining the distribution of the statistic under the null hypothesis). All this other functionality (nan_policy, vectorization, one- and two-sided p-values, confidence intervals, etc.) can be inherited - or overridden if desired. The goal is for the interfaces and capabilities of most tests to be pretty consistent without requiring contributors to re-invent the wheel in every PR. rv_continuous and rv_discrete are not perfect, but think what it would be like if everyone had to implement distributions from scratch! A lot of distributions would be missing a lot of methods, there would be different names for the same methods and parameters, there would be varying behaviors in case of invalid inputs (maybe output NaN, maybe raise this or that error), and who knows what support for vectorization would look like. But this is exactly what we have in the case of the hypothesis tests and correlation functions now! I think that’s why I have prioritized it.

There has also been some effort toward #1. @tirthasheshpatel and I have been talking in recent PRs about overhauling the test suite of distributions so that we can find all the obvious bugs where distribution methods are not living up to their public signatures.

I don’t know that we should work on all aspects of these projects simultaneously. So, in the case of working on overhaul of rv_continuous/discrete, I think it would be better if this were to wait a bit. I really would like to be a part of it, but I don’t think I can do it in parallel with all the rest. Also, I think it would help to have a more thorough test suite to help us characterize the shortcomings of rv_continuous and rv_discrete before we rewrite the distribution classes (or factory functions). And personally, I’d prefer to get a little further along toward the other stuff - item 2 - before digging deeply into those test suite improvements.

So maybe I’d suggest this order of high-level maintenance operations: 1a. Get gh-14651 rolling. (It will take a long time to complete but once it gets rolling, each new PR will be pretty quick and easy, as has been the case with the alternative effort in gh-12506. I think we’ll know when we get there, and at that point, we can move on.) 1b. A base class (or factory function) for hypothesis tests, with the first example being the z-test (gh-13662). 2a. Overhaul the test suite of the distributions 2b. Look into what comes after rv_continuous and rv_discrete

I like some variety, but I’m not really efficient when I’m bouncing between dozen of PRs, waiting a few months at a time between updates and having to re-learn everything when I come back to it. I imagine that if a few of us were able to tag-team as authors and reviewers toward a common goal, we could get a lot done pretty quickly. What do you think?

0reactions

tupuicommented, Aug 28, 2021

This action plan sounds reasonable 👍 The overhaul of the distribution is a massive undertaking for sure and we would need to be sure to have a few maintainers on board to first avoid late discussions and second to do the hard work. This should not be a 1-2 man only project.

As usual, feel free to ping me if you feel I could help 😃

Top Results From Across the Web

scipy.stats.qmc.QMCEngine — SciPy v1.9.3 Manual

After subclassing QMCEngine to define the sampling strategy we want to use, we can create an instance to sample from. >>> engine =...

[Question] Desire for qMC library in scipy? #9695 - GitHub

I have a fast Cython implementation of a Sobol low-discrepancy quasi-random number generator using Owen scrambling.

SciPy 1.7.0 Release Notes

This new module provides Quasi-Monte Carlo (QMC) generators and associated helper functions. It provides a generic class scipy.stats.qmc.QMCEngine which defines ...

SciPy: doc/release/1.7.0-notes.rst - Fossies

It provides a generic class scipy.stats.qmc.QMCEngine which defines a QMC engine/sampler. An engine is state aware: it can be continued, advanced and reset....

statsmodels.tools.rng_qrng.check_random_state

array_like[ints] , a new ; Generator instance is used, seeded with seed . If seed is already a ; Generator , ; RandomState...