Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

flip_y in make_classification is misleading

See original GitHub issue

As per description of flip_y, it is “The fraction of samples whose class are randomly exchanged.” So when you have two classes one would expect by setting flip_y equal to 0.1, 10% of the labels flip (exchange), as the name suggest (flip_y). However, if you look at the source code 10% of the labels are assigned random labels which 50% of the time they are assigned their own labels so about 5% of labels are going to be flipped in the end.

This doesn’t seem like a big issue at first, but we have had so many people confused with flip_y in a competition on Kaggle at https://www.kaggle.com/c/instant-gratification.

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

akeshavancommented, Nov 2, 2019

The current description is: “The fraction of samples whose class are randomly exchanged. Larger values introduce noise in the labels and make the classification task harder.”

We are suggesting: “The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.”

Any other suggestions?

0reactions

TomDLTcommented, Nov 2, 2019

Is it adequate to change the description of the variable and keep the variable name flip_y constant, or is changing the variable name to something like random_fraction an option?

I think we would prefer not changing the variable name, and just improve the documentation.

Top Results From Across the Web

sklearn.datasets.make_classification

If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a...

mars.learn.datasets.make_classification

mars.learn.datasets.make_classification(n_samples=100, n_features=20, ... If False, the clusters are put on the vertices of a random polytope.

sklearn.datasets.make_classification fails to generate ...

1 Answer 1 ... Though its not explicitly mentioned and is confusing, the parameter weights require "proportions" of samples. It does not convert ......

Scoring Classifier Models using scikit-learn - Ben Alex Keen

Of course this doesn't provide any information about whether the model is has any false positives or false negatives.

Why Is Imbalanced Classification Difficult?

We can use the make_classification() scikit-learn function to ... or imbalanced treatment methods, focus on wrong areas of input space.