digitize is not the inverse of histogram (right limit treated differently)
See original GitHub issueI think digitize
should change its behavior with respect to binning a value that matches the rightmost bin’s right edge. Such a value should go in the rightmost bin and not be classified as “beyond the bounds”. This would make the output of histogram
compatible to digitize
(currently it’s not).
Example (short explanation)
>>> from numpy import histogram
>>> from numpy import digitize
>>> data = [1, 2, 3, 4, 5, 6, 7]
>>> histvals, binedges = histogram(data, 3)
>>> binedges
array([ 1., 3., 5., 7.])
>>> histvals
array([2, 2, 3])
# Up to here, a histogram has been created characterized by
# four bin edges (three bins). The rightmost bin contains
# three data values, as its right edge is considered to belong
# to the bin.
# Now assign the data values to given bins
# (represented by `binedges`) via `digitize`
>>> digitize(data, binedges)
array([1, 1, 2, 2, 3, 3, 4])
# The first two data values have been assigned to bin 1,
# the next two data values were assigned to bin 2,
# the next two data values were assigned to bin 3,
# the remaining data value has been declared as "out of bounds".
# This is where `histogram` and `digitize` behave differently by
# default.
# Reproduce `histogram` binning by manually shifting the
# rightmost bin edge by an epsilon value:
>>> binedges[-1] += 10**-6
>>> digitize(data, binedges)
array([1, 1, 2, 2, 3, 3, 3])
Longish explanation
I am assuming that digitize
generally is considered to do the inverse operation of histogram
, i.e. whereas histogram
creates bins (the edges of the bins) and assigns values to bins, digitize
(quote from digitize
docs) “Returns the indices of the bins to which each value in input array belongs.” (given the data values and the bin edges). This “reverse indexing” behavior has also been referred to in this issue: https://github.com/numpy/numpy/issues/990. It seems like other numerical frameworks provide this functionality, too.
In the histogram
specs (http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html), the meaning of the bin edges is clarified. By default, bins contain the left edge and do not contain the right edge, except for the rightmost bin, whose right edge belongs to the bin (“The last bin, however, is [3, 4], which includes 4.”)
In contrast, the digitize
specs state that “Each index i returned is such that bins[i-1] <= x < bins[i]”. There is no comment on the rightmost boundary, so this general statement also applies to it.
In conclusion, whereas histogram
always creates the rightmost bin edge in a way that it corresponds to the maximum data value, digitize
does not count this data value as part of the histogram represented by the bin edges array.
Restoring compatibility between both functions requires manually incrementing the rightmost bin edge by an epsilon value.
I think it comes down to the question of what people expect digitize
to do exactly. I would also be fine with always manually correcting one boundary, but am pretty sure that that has not been the original idea behind digitize
.
Issue Analytics
- State:
- Created 10 years ago
- Reactions:9
- Comments:10 (5 by maintainers)
Top GitHub Comments
Someday this would be a good idea to fix, considering people have been complaining about it for years now apparently.
This is the kind of thing that really puts people off of using a toolkit altogether. Even though it seems trivial, a lot of folks won’t find this thread or have any clue as to what to do from there, but would assume they are doing something wrong and go to unnecessary lengths to discretize themselves.
From a point of view of principles I would agree with @eric-wieser but from a point of view of usefulness, I think including the point makes a lot of sense. I think it’s very rare that the current behavior of
digitize
is useful, and in scikit-learn we’re actually manually changing the largest bin to be a bit larger than the largest value in the data to get the expected behavior: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/calibration.py#L592What are use-cases of calling digitize with always a single point in the last bin? (I also think it’s slightly strange that digitize returns indices starting at 1 by default, I would have assumed
np.bincount(np.digitize(y, bins)) == np.histogram(y, bins)[0]