average_precision_score() overestimates AUC value
See original GitHub issueDescription
The average_precision_score() function in sklearn doesn’t return a correct AUC value.
Steps/Code to Reproduce
Example:
import numpy as np
"""
Desc: average_precision_score returns overestimated AUC of precision-recall curve
"""
# pathological example
p = [0.833, 0.800] # precision
r = [0.294, 0.235] # recall
# computation of average_precision_score()
print("AUC = {:3f}".format(-np.sum(np.diff(r) * np.array(p)[:-1]))) # _binary_uninterpolated_average_precision()
# computation of auc() with trapezoid interpolation
print("AUC TRAP. = {:3f}".format(-np.trapz(p, r)))
# possible fix in _binary_uninterpolated_average_precision() **(edited)**
print("AUC FIX = {:3f}".format(-np.sum(np.diff(r) * np.minimum(p[:-1], p[1:])))
#>> AUC = 0.049147
#>> AUC TRAP. = 0.048174
#>> AUC FIX = 0.047200
Expected Results
AUC without interpolation = (0.294 - 0.235) * 0.800 = 0.472 AUC with trapezoidal interpolation = 0.472 + (0.294 - 0.235) * (0.833 - 0.800) / 2 = 0.0482
Actual Results
This is what sklearn implements for AUC without interpolation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html):
sum((r[i] - r[i+1]) * p[i] for i in range(len(p)-1))
>> 0.049147
This is what I think is correct (no longer; see edit):
sum((r[i] - r[i+1]) * p[i+1] for i in range(len(p)-1))
>> 0.047200
EDIT: I found that the above ‘correct’ implementation doesn’t always underestimate. It depends on the input. Therefore I have revised the uninterpolated AUC calculation to this:
sum((r[i] - r[i+1]) * min(p[i] + p[i+1]) for i in range(len(p)-1))
>> 0.047200
This has the advantage that the AUC calculation is more consistent; it is either equal or underestimated, but never overestimated (compared to the current uninterpolated AUC function). Below I show some examples on what it does:
- Example 1: all work fine
p = [0.3, 1.0]
r = [1.0, 0.0]
#Results:
>> 0.30 # sklearn's _binary_uninterpolated_average_precision()
>> 0.30 # my consistent _binary_uninterpolated_average_precision()
>> 0.65 # np.trapz() (trapezoidal interpolation)
- Example 2: sklearn’s _binary_uninterpolated_average_precision returns inaccurate number
p = [1.0, 0.3]
r = [1.0, 0.0]
#Results:
>> 1.00 # sklearn's _binary_uninterpolated_average_precision()
>> 0.30 # my consistent _binary_uninterpolated_average_precision()
>> 0.65 # np.trapz() (trapezoidal interpolation)
- Example 3: extra example
p = [0.4, 0.1, 1.0]
r = [1.0, 0.9, 0.0]
#Results:
>> 0.13 # sklearn's _binary_uninterpolated_average_precision()
>> 0.10 # my consistent _binary_uninterpolated_average_precision()
>> 0.52 # np.trapz() (trapezoidal interpolation)
Versions
Windows-10-10.0.17134-SP0 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)] NumPy 1.14.0 SciPy 1.0.0 Scikit-Learn 0.19.1
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:5 (4 by maintainers)
Top GitHub Comments
Thank you for your comment. I did see some of these older issues, but not all of them. I did actually find some cases where the AUC value is underestimated as well, which makes the problem a bit more complex than I initially thought.
For datasets with a small number of precision and recall thresholds, it seems better for now to use the interpolated area under the curve (i.e. sklearn.metrics.auc() or np.trapz()), or am I mistaken?
As part of scikit-learn’s triaging guidelines, I am closing this issue because it is a duplicate of https://github.com/scikit-learn/scikit-learn/issues/4577.