question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: stats.binned_statistic_dd binnumber is not usable

See original GitHub issue

Describe your issue.

when using binned_statistic_dd one could be interested in getting the indexes/values of data points that fall in certain bins depending on the value of the computed statistic.

After computing the statistic, one could be tempted to use np.nonzero((hist >= x) & (hist < y)) paired with np.ravel_multi_index or other indexer functions and think that the returned indices should correspond, in principle, to binnumbers. Then, accessing your data is as easy as data[bs.binnumber == indices].

However, this is not the case. If you have (N, M) bins, It looks like binnumbers are counted as if there were 2 extra bins per data dimension, i.e. (N+2, M+2) despite the shape of statistc being (N, M).

So, in order to “fix” the mapping between binnumber and statistic indices, one should re-ravel binnumbers or their indices as if they were coming from a larger array and “shifted one diagonal element up/down”.

I guess this is needed internally when one wants to reuse previous results, however I think it would be better for binned_statistic_dd to encode/decode binnumbers so that they are are consistent with the “user’s side” data shapes.

Reproducing Code Example

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binned_statistic_dd

rng = np.random.default_rng(42)
n1, n2 = 10, 10
data = rng.normal(size=(100, 2))
stat = binned_statistic_dd(data, np.zeros(len(data)), statistic="count", bins=(n1, n2))
hist = stat.statistic
binnr = stat.binnumber

try:
    assert len(data[binnr == np.argmax(hist)]) == hist.max(), "this should not fail"
except AssertionError as e:
    print(e)

binnr_fix = []
for bn in binnr:
    i1, i2 = np.unravel_index(bn, (n1 + 2, n2 + 2))
    binnr_fix.append(np.ravel_multi_index((i1 - 1, i2 - 1), np.array(hist.shape)))

assert len(data[binnr_fix == np.argmax(hist)]) == np.max(hist)
max_ij = np.nonzero(hist == hist.max())
assert len(data[binnr == np.ravel_multi_index((max_ij[0] + 1, max_ij[1] + 1), (n1 + 2 , n2 + 2))]) == np.max(hist)



### Error message

```shell
-

SciPy/NumPy/Python version information

1.7.3 1.21.5 sys.version_info(major=3, minor=9, micro=12, releaselevel='final', serial=0)

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mdhabercommented, May 19, 2022

if the two extra bins are never returned in statistic

I see. Yeah, that’s surprising to me.

0reactions
rebelotcommented, May 18, 2022

thanks for your reply. My question is, what is the point of having the bins numbered this way if the two extra bins are never returned in statistic? If a combination of edges or range makes some data fall outside the bin edges, they should not be assigned to any bin and I think its reasonable to assign bin number nan to outlier data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

incorrect binnumbers returned · Issue #7010 · scipy ... - GitHub
For certain inputs binned_statistic_2d returns incorrect bin numbers xEdges = np.arange(79950.,500050.,100.) yEdges = np.arange(7489950.
Read more >
binned_statistic + handling NaNs (ValueError with SciPy and ...
stats.binned_statistic to raise a ValueError whenever the data contains a non-finite number (e.g., nan , inf ). This new behavior is ...
Read more >
scipy.stats.binned_statistic — SciPy v1.9.3 Manual
A histogram divides the space into bins, and returns the count of the number of points in each bin. This function allows the...
Read more >
sciPy stats.binned_statistic() function | Python - GeeksforGeeks
function computes the binned statistics value for the given data (array elements). It works similar to histogram function. As histogram ...
Read more >
SciPy 1.8.0 Release Notes
There have been a number of deprecations and API changes in this release, ... The result for empty bins for scipy.stats.binned_statistic with the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found