Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

digitize is not the inverse of histogram (right limit treated differently)

See original GitHub issue

I think digitize should change its behavior with respect to binning a value that matches the rightmost bin’s right edge. Such a value should go in the rightmost bin and not be classified as “beyond the bounds”. This would make the output of histogram compatible to digitize (currently it’s not).

Example (short explanation)

>>> from numpy import histogram
>>> from numpy import digitize
>>> data = [1, 2, 3, 4, 5, 6, 7]
>>> histvals, binedges = histogram(data, 3)
>>> binedges
array([ 1.,  3.,  5.,  7.])
>>> histvals
array([2, 2, 3])

# Up to here, a histogram has been created characterized by 
# four bin edges (three bins). The rightmost bin contains 
# three data values, as its right edge is considered to belong
# to the bin.

# Now assign the data values to given bins 
# (represented by `binedges`) via `digitize`
>>> digitize(data, binedges)
array([1, 1, 2, 2, 3, 3, 4])

# The first two data values have been assigned to bin 1,
# the next two data values were assigned to bin 2,
# the next two data values were assigned to bin 3,
# the remaining data value has been declared as "out of bounds".
# This is where `histogram` and `digitize` behave differently by
# default.

# Reproduce `histogram` binning by manually shifting the 
# rightmost bin edge by an epsilon value:
>>> binedges[-1] += 10**-6
>>> digitize(data, binedges)
array([1, 1, 2, 2, 3, 3, 3])

Longish explanation

I am assuming that digitize generally is considered to do the inverse operation of histogram, i.e. whereas histogram creates bins (the edges of the bins) and assigns values to bins, digitize (quote from digitize docs) “Returns the indices of the bins to which each value in input array belongs.” (given the data values and the bin edges). This “reverse indexing” behavior has also been referred to in this issue: https://github.com/numpy/numpy/issues/990. It seems like other numerical frameworks provide this functionality, too.

In the histogram specs (http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html), the meaning of the bin edges is clarified. By default, bins contain the left edge and do not contain the right edge, except for the rightmost bin, whose right edge belongs to the bin (“The last bin, however, is [3, 4], which includes 4.”)

In contrast, the digitize specs state that “Each index i returned is such that bins[i-1] <= x < bins[i]”. There is no comment on the rightmost boundary, so this general statement also applies to it.

In conclusion, whereas histogram always creates the rightmost bin edge in a way that it corresponds to the maximum data value, digitize does not count this data value as part of the histogram represented by the bin edges array.

Restoring compatibility between both functions requires manually incrementing the rightmost bin edge by an epsilon value.

I think it comes down to the question of what people expect digitize to do exactly. I would also be fine with always manually correcting one boundary, but am pretty sure that that has not been the original idea behind digitize.

Issue Analytics

State:
Created 10 years ago
Reactions:9
Comments:10 (5 by maintainers)

Top GitHub Comments

9reactions

jrossyracommented, Mar 19, 2019

Someday this would be a good idea to fix, considering people have been complaining about it for years now apparently.

This is the kind of thing that really puts people off of using a toolkit altogether. Even though it seems trivial, a lot of folks won’t find this thread or have any clue as to what to do from there, but would assume they are doing something wrong and go to unnecessary lengths to discretize themselves.

5reactions

amuellercommented, Sep 3, 2019

From a point of view of principles I would agree with @eric-wieser but from a point of view of usefulness, I think including the point makes a lot of sense. I think it’s very rare that the current behavior of digitize is useful, and in scikit-learn we’re actually manually changing the largest bin to be a bit larger than the largest value in the data to get the expected behavior: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/calibration.py#L592

What are use-cases of calling digitize with always a single point in the last bin? (I also think it’s slightly strange that digitize returns indices starting at 1 by default, I would have assumed np.bincount(np.digitize(y, bins)) == np.histogram(y, bins)[0]

Top Results From Across the Web

A Complete Guide to Histograms | Tutorial by Chartio

Histograms are a common chart type used to look at distributions of numeric variables. Check out this guide to learn how to use...

Cumulative Histogram - an overview | ScienceDirect Topics

The process of histogram shaping generalizes histogram equalization, which is the special case where the target shape is flat. Histogram shaping can be...

Intensity Transformation and Spatial Filtering

Multiplying each pixel by 2 and then, dividing it by 2 will not yield the ... use is to adjust the contrast by...

ROOT Tutorial at GridKa School 2013

C++ not trivial to interpret and not foreseen in the language standard! ... Histogram drawing is handled internally by the.

6 Reasons Why You Should Stop Using Histograms (and ...

Histograms are not free of biases. Actually, they are arbitrary and may lead to wrong conclusions about data. If you want to visualize...