BUG: stats.binned_statistic_dd binnumber is not usable
See original GitHub issueDescribe your issue.
when using binned_statistic_dd
one could be interested in getting the indexes/values of data points that fall in certain bins depending on the value of the computed statistic.
After computing the statistic, one could be tempted to use np.nonzero((hist >= x) & (hist < y))
paired with np.ravel_multi_index
or other indexer functions and think that the returned indices should correspond, in principle, to binnumbers. Then, accessing your data is as easy as data[bs.binnumber == indices]
.
However, this is not the case. If you have (N, M)
bins, It looks like binnumbers are counted as if there were 2 extra bins per data dimension, i.e. (N+2, M+2)
despite the shape of statistc
being (N, M)
.
So, in order to “fix” the mapping between binnumber and statistic indices, one should re-ravel binnumbers or their indices as if they were coming from a larger array and “shifted one diagonal element up/down”.
I guess this is needed internally when one wants to reuse previous results, however I think it would be better for binned_statistic_dd to encode/decode binnumbers so that they are are consistent with the “user’s side” data shapes.
Reproducing Code Example
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic_dd
rng = np.random.default_rng(42)
n1, n2 = 10, 10
data = rng.normal(size=(100, 2))
stat = binned_statistic_dd(data, np.zeros(len(data)), statistic="count", bins=(n1, n2))
hist = stat.statistic
binnr = stat.binnumber
try:
assert len(data[binnr == np.argmax(hist)]) == hist.max(), "this should not fail"
except AssertionError as e:
print(e)
binnr_fix = []
for bn in binnr:
i1, i2 = np.unravel_index(bn, (n1 + 2, n2 + 2))
binnr_fix.append(np.ravel_multi_index((i1 - 1, i2 - 1), np.array(hist.shape)))
assert len(data[binnr_fix == np.argmax(hist)]) == np.max(hist)
max_ij = np.nonzero(hist == hist.max())
assert len(data[binnr == np.ravel_multi_index((max_ij[0] + 1, max_ij[1] + 1), (n1 + 2 , n2 + 2))]) == np.max(hist)
### Error message
```shell
-
SciPy/NumPy/Python version information
1.7.3 1.21.5 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
I see. Yeah, that’s surprising to me.
thanks for your reply. My question is, what is the point of having the bins numbered this way if the two extra bins are never returned in
statistic
? If a combination of edges orrange
makes some data fall outside the bin edges, they should not be assigned to any bin and I think its reasonable to assign bin numbernan
to outlier data.