Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

average_precision_score() overestimates AUC value

See original GitHub issue

Description

The average_precision_score() function in sklearn doesn’t return a correct AUC value.

Steps/Code to Reproduce

Example:

import numpy as np
"""
    Desc: average_precision_score returns overestimated AUC of precision-recall curve
"""
# pathological example
p = [0.833, 0.800] # precision
r = [0.294, 0.235] # recall

# computation of average_precision_score()
print("AUC       = {:3f}".format(-np.sum(np.diff(r) * np.array(p)[:-1]))) # _binary_uninterpolated_average_precision()

# computation of auc() with trapezoid interpolation
print("AUC TRAP. = {:3f}".format(-np.trapz(p, r)))

# possible fix in _binary_uninterpolated_average_precision() **(edited)**
print("AUC FIX   = {:3f}".format(-np.sum(np.diff(r) * np.minimum(p[:-1], p[1:])))

#>> AUC       = 0.049147
#>> AUC TRAP. = 0.048174
#>> AUC FIX   = 0.047200

Expected Results

AUC without interpolation = (0.294 - 0.235) * 0.800 = 0.472 AUC with trapezoidal interpolation = 0.472 + (0.294 - 0.235) * (0.833 - 0.800) / 2 = 0.0482

Actual Results

This is what sklearn implements for AUC without interpolation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html):

sum((r[i] - r[i+1]) * p[i] for i in range(len(p)-1))
>> 0.049147

This is what I think is correct (no longer; see edit):

sum((r[i] - r[i+1]) * p[i+1] for i in range(len(p)-1))
>> 0.047200

EDIT: I found that the above ‘correct’ implementation doesn’t always underestimate. It depends on the input. Therefore I have revised the uninterpolated AUC calculation to this:

sum((r[i] - r[i+1]) * min(p[i] + p[i+1]) for i in range(len(p)-1)) 
>> 0.047200

This has the advantage that the AUC calculation is more consistent; it is either equal or underestimated, but never overestimated (compared to the current uninterpolated AUC function). Below I show some examples on what it does:

Example 1: all work fine

p = [0.3, 1.0]
r = [1.0, 0.0]

#Results:
>> 0.30    # sklearn's _binary_uninterpolated_average_precision()
>> 0.30    # my consistent _binary_uninterpolated_average_precision()
>> 0.65    # np.trapz() (trapezoidal interpolation)

pr_curve1

Example 2: sklearn’s _binary_uninterpolated_average_precision returns inaccurate number

p = [1.0, 0.3]
r = [1.0, 0.0]

#Results:
>> 1.00    # sklearn's _binary_uninterpolated_average_precision()
>> 0.30    # my consistent _binary_uninterpolated_average_precision()
>> 0.65    # np.trapz() (trapezoidal interpolation)

pr_curve2

Example 3: extra example

p = [0.4, 0.1, 1.0]
r = [1.0, 0.9, 0.0]

#Results:
>> 0.13      # sklearn's _binary_uninterpolated_average_precision()
>> 0.10      # my consistent _binary_uninterpolated_average_precision()
>> 0.52      # np.trapz() (trapezoidal interpolation)

pr_curve3

Versions

Windows-10-10.0.17134-SP0 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)] NumPy 1.14.0 SciPy 1.0.0 Scikit-Learn 0.19.1

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

barrydebruincommented, Jan 31, 2019

Thank you for your comment. I did see some of these older issues, but not all of them. I did actually find some cases where the AUC value is underestimated as well, which makes the problem a bit more complex than I initially thought.

For datasets with a small number of precision and recall thresholds, it seems better for now to use the interpolated area under the curve (i.e. sklearn.metrics.auc() or np.trapz()), or am I mistaken?

0reactions

thomasjpfancommented, Apr 22, 2022

As part of scikit-learn’s triaging guidelines, I am closing this issue because it is a duplicate of https://github.com/scikit-learn/scikit-learn/issues/4577.

Top Results From Across the Web

Sklearn Average_Precision_Score vs. AUC - Cross Validated

Average precision score is a way to calculate AUPR. We'll discuss AUROC and AUPRC in the context of binary classification for simplicity.

sklearn.metrics.average_precision_score

Compute average precision (AP) from prediction scores. AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, ...

The wrong and right way to approximate Area Under Precision ...

This article attempts to analyze two common ways to approximate AUPRC: either using the trapezoidal rule or using the average precision score, and...

Magician's Corner: 9. Performance Metrics for Machine ... - NCBI

AUCROC is not dependent on disease prevalence, so it is not a good metric in cases with class imbalance because it may overestimate...

What does an AUC in test data less than an AUC in training ...

If the AUC is only slightly worse on the test data, that's normal. In practice, it's almost impossible to achieve the same performance...