Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

metrics.ndcg_score is busted

See original GitHub issue

Description

metrics.ndcg_score is busted

Steps/Code to Reproduce

from sklearn import metrics

# test 1
y_true = [0, 1, 2, 1]
y_score = [[0.15, 0.55, 0.2], [0.7, 0.2, 0.1], [0.06, 0.04, 0.9], [0.1, 0.3, 0.6]]
metrics.ndcg_score(y_true, y_score)

# test 2
y_true = [0, 1, 0, 1]
y_score = [[0.15, 0.85], [0.7, 0.3], [0.06, 0.94], [0.7, 0.3]]
metrics.ndcg_score(y_true, y_score)

Expected Results

No error is thrown.

Actual Results

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-35bb0e2c9b0e> in <module>()
----> 1 metrics.ndcg_score(y_true, y_score)

/Users/iancassidy/virtualenvs/upside/lib/python2.7/site-packages/sklearn/metrics/ranking.py in ndcg_score(y_true, y_score, k)
    849 
    850     if binarized_y_true.shape != y_score.shape:
--> 851         raise ValueError("y_true and y_score have different value ranges")

ValueError: y_true and y_score have different value ranges

Versions

Darwin-16.7.0-x86_64-i386-64bit
('Python', '2.7.10 (default, Feb  7 2017, 00:08:15) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]')
('NumPy', '1.13.3')
('SciPy', '0.19.1')
('Scikit-Learn', '0.19.0')

Issue Analytics

State:
Created 6 years ago
Comments:21 (18 by maintainers)

Top GitHub Comments

2reactions

jeromedockescommented, Oct 16, 2017

Also, I strongly doubt whether the implementation is right. It’s not consistent with the wiki page, nor with all the materials I can find. And the reference link in the code seems dead. Personally, I might think ogrisel’s implementation here is correct.

I completey agree. NDCG is meant to evaluate a ranking with respect to the true scores of the scored entities.

(see the wikipedia page, Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4), 422-446., or Wang, Y., Wang, L., Li, Y., He, D., Chen, W., & Liu, T. Y. (2013, May). A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013).)

For example, evaluate a ranking of answers to a query with respect to the actual relevance of the answers. Ogrisel’s code for which @qinhanmin2014 provided a link is a typical use case of NDCG, and the implementation is correct. So ndcg_score should accept two 2-d arrays of the same shape, y_score contains the scores inducing the predicted ranking and y_true containing a floating-point value (e.g. relevance, term frequency, …) for each output dimension. for example something like this should be ok:

import numpy as np


def _cumulative_gain(relevance, ranking, k=None):
    relevance = np.atleast_2d(relevance)
    ranking = np.atleast_2d(ranking)
    ranked = relevance[np.arange(ranking.shape[0])[:, np.newaxis], ranking]
    if k is not None:
        ranked = ranked[:, :k]
    log_indices = np.log(np.arange(ranked.shape[1]) + 2)
    gain = (ranked / log_indices).sum(axis=1)
    return gain


def normalized_discounted_cumulative_gain(y_true, y_score, k=None):
    prediction_ranking = np.argsort(y_score)[:, ::-1]
    true_ranking = np.argsort(y_true)[:, ::-1]
    gain = _cumulative_gain(y_true, prediction_ranking, k)
    normalizing_gain = _cumulative_gain(y_true, true_ranking, k)
    all_irrelevant = normalizing_gain == 0
    gain[all_irrelevant] = 0
    gain[~all_irrelevant] /= normalizing_gain[~all_irrelevant]
    return gain

we can check wether y_true is a vector of labels instead of a matrix of true scores and perform one-hot encoding, but since this is not the most common use case it may be better to keep the interface simple and let the user one-hot encode it they want to do this.

1reaction

jeromedockescommented, Oct 16, 2017

So without reading too deeply into the above discussion, the problem stems from the fact that ndcg is apparently the first metric we have implemented which supports multiclass (not multilabel or binary) classification with a score, and we’ve just not implemented it right.

NDCG is not for classification; y_score and y_true should have the same shape

Top Results From Across the Web

sklearn.metrics.ndcg_score — scikit-learn 1.2.0 documentation

Compute Normalized Discounted Cumulative Gain. Sum the true scores ranked in the order induced by the predicted scores, after applying a logarithmic discount....

Why does ndcg_score result in nan values? - Stack Overflow

I cannot recreate the error you are reporting, but using error_score="raise" and n_jobs=1 (not strictly necessary, but the output is a ...

How to use ndcg metric for binary relevance

I am working on a ranking problem to predict the right single document based on the user query and use the NDCG metric...

Source code for sklearn.metrics._ranking

"""Metrics to assess performance on classification task given scores. ... Normalized Discounted Cumulative Gain (NDCG, computed by ndcg_score) is preferred.

tfr.keras.metrics.NDCGMetric - Ranking - TensorFlow

where rank ( s i ) is the rank of item i after sorting by scores s with ties broken randomly. References. Cumulated...