Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

metrics.confusion_matrix far too slow for Boolean cases

See original GitHub issue

Description

When using metrics.confusion_matrix with np.bool_ cases (e.g. with only True/False values), it is far more fast to not use list comprehensions as the current code does. numpy has sum and Boolean logic functions to deal with this very efficiently to scale far better. Code found in: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/_classification.py which uses list comprehensions and never checks for np.bool_ types trying to avoid all the excessive code. This is a very common and reasonable use case in practice. Assuming normalization and no sample weights though those also can be efficiently dealt with.

Steps/Code to Reproduce

import numpy as np

N = 4096
p = 0.5
a = np.random.choice(a=[False, True], size=(N), p=[p, 1-p])
b = np.random.choice(a=[False, True], size=(N), p=[p, 1-p])
from sklearn.metrics import confusion_matrix
for i in range(1024): confusion_matrix(a, b)

Expected Results

Fast execution time. e.g. substituting confusion_matrix with this conf_mat (not efficient as possible but easier to read and more efficient than current library even with 4 sums, 4 logical ANDs and 4 logical NOTs:

        def conf_mat(x, y):
            return np.array([[np.sum(~x & ~y), np.sum(~x & y)], #true negatives, false positives
                             [np.sum(x & ~y), np.sum(x & y)]]) #false negatives, true positives

Or even faster with 1 logical AND with 3 sums:

        def conf_mat(x, y):
            truepos, totalpos, totaltrue = np.sum(x & y), np.sum(y), np.sum(x)
            totalfalse = len(x) - totaltrue
            return np.array([[totalfalse - falseneg, totalpos - truepos], #true negatives, false positives
                             [totaltrue - truepos, truepos]]) #false negatives, true positives

Actual Results

Slow execution time around 60 times slower than efficient code. The np.bool_ case could be identified and efficient code applied otherwise in serious cases of scale, the current code is too slow to be practically usable.

Versions

All - including 0.21

Issue Analytics

State:
Created 4 years ago
Comments:18 (17 by maintainers)

Top GitHub Comments

3reactions

jnothmancommented, Oct 30, 2019

Also consider, if x and y are Boolean:

confusion = np.bincount(y_true * 2 + y_pred, minlength=4).reshape(2, 2)

1reaction

dyollbcommented, Sep 24, 2021

I am evaluation 3D ML-based segmentation predictions and was looking for a fast confusion matrix implementation. My observation was that sklearn.metrics.confusion_matrix is quite slow. In fact it is slower than loading the data and doing inference with a UNet.

I did a comparison of different ways to compute the confusion matrix:

import numpy as np
from numba import njit, generated_jit, types
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
import pandas
from timeit import default_timer as timer

def compute_confusion_naive(a: np.ndarray, b: np.ndarray, num_classes: int = 0):
    if num_classes < 1:
        num_classes = max(np.max(a), np.max(b)) + 1
    cm = np.zeros((num_classes, num_classes))
    for i in range(a.shape[0]):
        cm[a[i], b[i]] += 1
    return cm


def compute_confusion_zip(a: np.ndarray, b: np.ndarray, num_classes: int = 0):
    if num_classes < 1:
        num_classes = max(np.max(a), np.max(b)) + 1
    cm = np.zeros((num_classes, num_classes))
    for ai, bi in zip(a, b):
        cm[ai, bi] += 1
    return cm


@njit
def compute_confusion_numba(a: np.ndarray, b: np.ndarray, num_classes: int = 0):
    if num_classes < 1:
        num_classes = max(np.max(a), np.max(b)) + 1
    cm = np.zeros((num_classes, num_classes))
    for i in range(a.shape[0]):
        cm[a[i], b[i]] += 1
    return cm


def compute_confusion_sklearn(a: np.ndarray, b: np.ndarray):
    return sk_confusion_matrix(a, b)


def compute_confusion_pandas(a: np.ndarray, b: np.ndarray):
    return pandas.crosstab(pandas.Series(a), pandas.Series(b))


if __name__ == "__main__":
    A = np.random.randint(15, size=310*310*310)
    B = np.random.randint(15, size=310*310*310)

    start = timer()
    cm1 = compute_confusion_naive(A, B)
    end = timer()
    print("Naive: %g s" % (end-start))

    start = timer()
    cm1 = compute_confusion_zip(A, B)
    end = timer()
    print("Naive-Zip: %g s" % (end-start))

    start = timer()
    cm1 = compute_confusion_sklearn(A, B)
    end = timer()
    print("sklearn: %g s" % (end-start))

    start = timer()
    cm1 = compute_confusion_numba(A, B, 0)
    end = timer()
    print("Numba: %g s" % (end-start))

    start = timer()
    cm1 = compute_confusion_pandas(A, B)
    end = timer()
    print("pandas: %g s" % (end-start))

The results are:

Naive: 18.6546 s Naive-Zip: 17.86 s sklearn: 18.5911 s Numba: 0.674944 s pandas: 5.81173 s

The timing for the numba implementation can be optimized further (by half) if num_classes is known, using dispatch via generated_jit to skip computing the max of a and b.

Top Results From Across the Web

Understanding Confusion Matrix | by Sarang Narkhede

Confusion Matrix is a performance measurement for machine learning classification. This blog aims to answer the following questions: What the confusion matrix ......

Demystify the Confusion!!! - Medium

This concept played a very crucial role in analyzing the models and this article is my understanding of confusion matrix.

Balanced Accuracy: When Should You Use It? - neptune.ai

Metrics are used to judge and measure model performance after training. One important tool that shows the performance of our model is the...

python - How to write a confusion matrix - Stack Overflow

Scikit-Learn provides a confusion_matrix function from sklearn.metrics import confusion_matrix y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2]...

Efficiently Factorizing Boolean Matrices using Proximal ...

Abstract: Addressing the interpretability problem of NMF on Boolean data, Boolean Matrix Factorization (BMF) uses Boolean algebra to ...