Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut

See original GitHub issue

This piece of code yields perfect classification accuracy for random data:

import numpy as np
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.svm import SVC

scores = cross_val_score(SVC(kernel='linear', class_weight='balanced', C=1e-08), 
                         np.random.rand(79, 100), 
                         y=np.hstack((np.ones(20), np.zeros(59))), 
                         cv=LeaveOneOut())
print(scores)

The problem disappears when using class_weight=None or another CV.

Is it a bug or am I missing something?

Tested with version 0.19.1 of scikit-learn on Ubuntu Linux.

Issue Analytics

State:
Created 6 years ago
Comments:23 (13 by maintainers)

Top GitHub Comments

1reaction

m-guggenmoscommented, Dec 4, 2017

I’m not aware of class weighting procedures other than ‘balanced’, which of course does not mean they don’t exist. In my opinion, the way this is handled with the ‘balanced’ option in sklearn is exemplary, precisely because the weights are computed on the training data only.

I would say that I came across the problem relatively organically. To elaborate, I was using GridSearchCV on the C parameter of SVC and after setting class_weight='balanced' I suddenly got amazing accuracies on a real-world data set (i.e., not artificial/random data). I then realized that GridSearchCV was selecting very low values of C, i.e. no regularization at all, which at first was even weirder.

Based on this experience I’m inclined to recommend inclusion of your patch, because I’m sure many people will not investigate further when accuracies are good and ‘publishable’. The effect of changing class weights in the order of 1e-8 should be negligible in almost all cases, and if not, it’s likely because of this very issue. I see the trade-off with exact backwards compatibility though.

0reactions

jbschiratticommented, Oct 27, 2022

@jnothman I’m following up on a previous discussion. Unless I am mistaken, if class_weight='balanced' is passed to LogisticRegressionCV, the class weights are computed the labels of the entire dataset. This breaks the independence of training and test data. Is there a specific reason for that?

Top Results From Across the Web

sklearn.model_selection.LeaveOneOut

Provides train/test indices to split data in train/test sets. Each sample is used once as a test set (singleton) while the remaining samples...

Multimodal Depression Detection: An Investigation of Features ...

5.4 Depression classification accuracy reported for leave-one-out cross- ... Lack of research on how to combine modalities (i.e. fusion techniques) for this ...

Text, Speech, and Dialogue - Springer

In this talk we will discuss why such attacks are possible, and the problem of designing, identifying, and avoiding attacks by such crafted ......

Predictive modeling of moonlighting DNA-binding proteins

PDF | Moonlighting proteins are multifunctional, single-polypeptide chains capable of performing multiple autonomous functions.

NAACL HLT 2018 Computational Linguistics and Clinical ...

Language technology can support mental health clinicians, service organizations, ... error is to miss an individual with depression.