Possible bug when combining SVC + class_weights='balanced' + LeaveOneOut
See original GitHub issueThis piece of code yields perfect classification accuracy for random data:
import numpy as np
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.svm import SVC
scores = cross_val_score(SVC(kernel='linear', class_weight='balanced', C=1e-08),
np.random.rand(79, 100),
y=np.hstack((np.ones(20), np.zeros(59))),
cv=LeaveOneOut())
print(scores)
The problem disappears when using class_weight=None
or another CV.
Is it a bug or am I missing something?
Tested with version 0.19.1 of scikit-learn on Ubuntu Linux.
Issue Analytics
- State:
- Created 6 years ago
- Comments:23 (13 by maintainers)
Top Results From Across the Web
sklearn.model_selection.LeaveOneOut
Provides train/test indices to split data in train/test sets. Each sample is used once as a test set (singleton) while the remaining samples...
Read more >Multimodal Depression Detection: An Investigation of Features ...
5.4 Depression classification accuracy reported for leave-one-out cross- ... Lack of research on how to combine modalities (i.e. fusion techniques) for this ...
Read more >Text, Speech, and Dialogue - Springer
In this talk we will discuss why such attacks are possible, and the problem of designing, identifying, and avoiding attacks by such crafted ......
Read more >Predictive modeling of moonlighting DNA-binding proteins
PDF | Moonlighting proteins are multifunctional, single-polypeptide chains capable of performing multiple autonomous functions.
Read more >NAACL HLT 2018 Computational Linguistics and Clinical ...
Language technology can support mental health clinicians, service organizations, ... error is to miss an individual with depression.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’m not aware of class weighting procedures other than ‘balanced’, which of course does not mean they don’t exist. In my opinion, the way this is handled with the ‘balanced’ option in sklearn is exemplary, precisely because the weights are computed on the training data only.
I would say that I came across the problem relatively organically. To elaborate, I was using
GridSearchCV
on the C parameter ofSVC
and after settingclass_weight='balanced'
I suddenly got amazing accuracies on a real-world data set (i.e., not artificial/random data). I then realized thatGridSearchCV
was selecting very low values of C, i.e. no regularization at all, which at first was even weirder.Based on this experience I’m inclined to recommend inclusion of your patch, because I’m sure many people will not investigate further when accuracies are good and ‘publishable’. The effect of changing class weights in the order of 1e-8 should be negligible in almost all cases, and if not, it’s likely because of this very issue. I see the trade-off with exact backwards compatibility though.
@jnothman I’m following up on a previous discussion. Unless I am mistaken, if
class_weight='balanced'
is passed toLogisticRegressionCV
, the class weights are computed the labels of the entire dataset. This breaks the independence of training and test data. Is there a specific reason for that?