question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

average_precision_score() overestimates AUC value

See original GitHub issue

Description

The average_precision_score() function in sklearn doesn’t return a correct AUC value.

Steps/Code to Reproduce

Example:

import numpy as np
"""
    Desc: average_precision_score returns overestimated AUC of precision-recall curve
"""
# pathological example
p = [0.833, 0.800] # precision
r = [0.294, 0.235] # recall

# computation of average_precision_score()
print("AUC       = {:3f}".format(-np.sum(np.diff(r) * np.array(p)[:-1]))) # _binary_uninterpolated_average_precision()

# computation of auc() with trapezoid interpolation
print("AUC TRAP. = {:3f}".format(-np.trapz(p, r)))

# possible fix in _binary_uninterpolated_average_precision() **(edited)**
print("AUC FIX   = {:3f}".format(-np.sum(np.diff(r) * np.minimum(p[:-1], p[1:])))

#>> AUC       = 0.049147
#>> AUC TRAP. = 0.048174
#>> AUC FIX   = 0.047200

Expected Results

AUC without interpolation = (0.294 - 0.235) * 0.800 = 0.472 AUC with trapezoidal interpolation = 0.472 + (0.294 - 0.235) * (0.833 - 0.800) / 2 = 0.0482

Actual Results

This is what sklearn implements for AUC without interpolation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html):

sum((r[i] - r[i+1]) * p[i] for i in range(len(p)-1))
>> 0.049147

This is what I think is correct (no longer; see edit):

sum((r[i] - r[i+1]) * p[i+1] for i in range(len(p)-1))
>> 0.047200

EDIT: I found that the above ‘correct’ implementation doesn’t always underestimate. It depends on the input. Therefore I have revised the uninterpolated AUC calculation to this:

sum((r[i] - r[i+1]) * min(p[i] + p[i+1]) for i in range(len(p)-1)) 
>> 0.047200

This has the advantage that the AUC calculation is more consistent; it is either equal or underestimated, but never overestimated (compared to the current uninterpolated AUC function). Below I show some examples on what it does:

  • Example 1: all work fine
p = [0.3, 1.0]
r = [1.0, 0.0]

#Results:
>> 0.30    # sklearn's _binary_uninterpolated_average_precision()
>> 0.30    # my consistent _binary_uninterpolated_average_precision()
>> 0.65    # np.trapz() (trapezoidal interpolation)

pr_curve1

  • Example 2: sklearn’s _binary_uninterpolated_average_precision returns inaccurate number
p = [1.0, 0.3]
r = [1.0, 0.0]

#Results:
>> 1.00    # sklearn's _binary_uninterpolated_average_precision()
>> 0.30    # my consistent _binary_uninterpolated_average_precision()
>> 0.65    # np.trapz() (trapezoidal interpolation)

pr_curve2

  • Example 3: extra example
p = [0.4, 0.1, 1.0]
r = [1.0, 0.9, 0.0]

#Results:
>> 0.13      # sklearn's _binary_uninterpolated_average_precision()
>> 0.10      # my consistent _binary_uninterpolated_average_precision()
>> 0.52      # np.trapz() (trapezoidal interpolation)

pr_curve3

Versions

Windows-10-10.0.17134-SP0 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)] NumPy 1.14.0 SciPy 1.0.0 Scikit-Learn 0.19.1

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
barrydebruincommented, Jan 31, 2019

Thank you for your comment. I did see some of these older issues, but not all of them. I did actually find some cases where the AUC value is underestimated as well, which makes the problem a bit more complex than I initially thought.

For datasets with a small number of precision and recall thresholds, it seems better for now to use the interpolated area under the curve (i.e. sklearn.metrics.auc() or np.trapz()), or am I mistaken?

0reactions
thomasjpfancommented, Apr 22, 2022

As part of scikit-learn’s triaging guidelines, I am closing this issue because it is a duplicate of https://github.com/scikit-learn/scikit-learn/issues/4577.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sklearn Average_Precision_Score vs. AUC - Cross Validated
Average precision score is a way to calculate AUPR. We'll discuss AUROC and AUPRC in the context of binary classification for simplicity.
Read more >
sklearn.metrics.average_precision_score
Compute average precision (AP) from prediction scores. AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, ...
Read more >
The wrong and right way to approximate Area Under Precision ...
This article attempts to analyze two common ways to approximate AUPRC: either using the trapezoidal rule or using the average precision score, and...
Read more >
Magician's Corner: 9. Performance Metrics for Machine ... - NCBI
AUCROC is not dependent on disease prevalence, so it is not a good metric in cases with class imbalance because it may overestimate...
Read more >
What does an AUC in test data less than an AUC in training ...
If the AUC is only slightly worse on the test data, that's normal. In practice, it's almost impossible to achieve the same performance...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found