question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut

See original GitHub issue

This piece of code yields perfect classification accuracy for random data:

import numpy as np
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.svm import SVC

scores = cross_val_score(SVC(kernel='linear', class_weight='balanced', C=1e-08), 
                         np.random.rand(79, 100), 
                         y=np.hstack((np.ones(20), np.zeros(59))), 
                         cv=LeaveOneOut())
print(scores)

The problem disappears when using class_weight=None or another CV.

Is it a bug or am I missing something?

Tested with version 0.19.1 of scikit-learn on Ubuntu Linux.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:23 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
m-guggenmoscommented, Dec 4, 2017

I’m not aware of class weighting procedures other than ‘balanced’, which of course does not mean they don’t exist. In my opinion, the way this is handled with the ‘balanced’ option in sklearn is exemplary, precisely because the weights are computed on the training data only.

I would say that I came across the problem relatively organically. To elaborate, I was using GridSearchCV on the C parameter of SVC and after setting class_weight='balanced' I suddenly got amazing accuracies on a real-world data set (i.e., not artificial/random data). I then realized that GridSearchCV was selecting very low values of C, i.e. no regularization at all, which at first was even weirder.

Based on this experience I’m inclined to recommend inclusion of your patch, because I’m sure many people will not investigate further when accuracies are good and ‘publishable’. The effect of changing class weights in the order of 1e-8 should be negligible in almost all cases, and if not, it’s likely because of this very issue. I see the trade-off with exact backwards compatibility though.

0reactions
jbschiratticommented, Oct 27, 2022

@jnothman I’m following up on a previous discussion. Unless I am mistaken, if class_weight='balanced' is passed to LogisticRegressionCV, the class weights are computed the labels of the entire dataset. This breaks the independence of training and test data. Is there a specific reason for that?

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.model_selection.LeaveOneOut
Provides train/test indices to split data in train/test sets. Each sample is used once as a test set (singleton) while the remaining samples...
Read more >
Multimodal Depression Detection: An Investigation of Features ...
5.4 Depression classification accuracy reported for leave-one-out cross- ... Lack of research on how to combine modalities (i.e. fusion techniques) for this ...
Read more >
Text, Speech, and Dialogue - Springer
In this talk we will discuss why such attacks are possible, and the problem of designing, identifying, and avoiding attacks by such crafted ......
Read more >
Predictive modeling of moonlighting DNA-binding proteins
PDF | Moonlighting proteins are multifunctional, single-polypeptide chains capable of performing multiple autonomous functions.
Read more >
NAACL HLT 2018 Computational Linguistics and Clinical ...
Language technology can support mental health clinicians, service organizations, ... error is to miss an individual with depression.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found