Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: stats.binned_statistic_dd binnumber is not usable

See original GitHub issue

Describe your issue.

when using binned_statistic_dd one could be interested in getting the indexes/values of data points that fall in certain bins depending on the value of the computed statistic.

After computing the statistic, one could be tempted to use np.nonzero((hist >= x) & (hist < y)) paired with np.ravel_multi_index or other indexer functions and think that the returned indices should correspond, in principle, to binnumbers. Then, accessing your data is as easy as data[bs.binnumber == indices].

However, this is not the case. If you have (N, M) bins, It looks like binnumbers are counted as if there were 2 extra bins per data dimension, i.e. (N+2, M+2) despite the shape of statistc being (N, M).

So, in order to “fix” the mapping between binnumber and statistic indices, one should re-ravel binnumbers or their indices as if they were coming from a larger array and “shifted one diagonal element up/down”.

I guess this is needed internally when one wants to reuse previous results, however I think it would be better for binned_statistic_dd to encode/decode binnumbers so that they are are consistent with the “user’s side” data shapes.

Reproducing Code Example

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic_dd

rng = np.random.default_rng(42)
n1, n2 = 10, 10
data = rng.normal(size=(100, 2))
stat = binned_statistic_dd(data, np.zeros(len(data)), statistic="count", bins=(n1, n2))
hist = stat.statistic
binnr = stat.binnumber

try:
    assert len(data[binnr == np.argmax(hist)]) == hist.max(), "this should not fail"
except AssertionError as e:
    print(e)

binnr_fix = []
for bn in binnr:
    i1, i2 = np.unravel_index(bn, (n1 + 2, n2 + 2))
    binnr_fix.append(np.ravel_multi_index((i1 - 1, i2 - 1), np.array(hist.shape)))

assert len(data[binnr_fix == np.argmax(hist)]) == np.max(hist)
max_ij = np.nonzero(hist == hist.max())
assert len(data[binnr == np.ravel_multi_index((max_ij[0] + 1, max_ij[1] + 1), (n1 + 2 , n2 + 2))]) == np.max(hist)



### Error message

```shell
-

SciPy/NumPy/Python version information

1.7.3 1.21.5 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

mdhabercommented, May 19, 2022

if the two extra bins are never returned in statistic

I see. Yeah, that’s surprising to me.

0reactions

rebelotcommented, May 18, 2022

thanks for your reply. My question is, what is the point of having the bins numbered this way if the two extra bins are never returned in statistic? If a combination of edges or range makes some data fall outside the bin edges, they should not be assigned to any bin and I think its reasonable to assign bin number nan to outlier data.