flip_y in make_classification is misleading
See original GitHub issueAs per description of flip_y, it is “The fraction of samples whose class are randomly exchanged.”
So when you have two classes one would expect by setting flip_y equal to 0.1, 10% of the labels flip (exchange), as the name suggest (flip_y). However, if you look at the source code 10% of the labels are assigned random labels which 50% of the time they are assigned their own labels so about 5% of labels are going to be flipped in the end.
This doesn’t seem like a big issue at first, but we have had so many people confused with flip_y in a competition on Kaggle at https://www.kaggle.com/c/instant-gratification.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
sklearn.datasets.make_classification
If True, the clusters are put on the vertices of a hypercube. If False, the clusters are put on the vertices of a...
Read more >mars.learn.datasets.make_classification
mars.learn.datasets.make_classification(n_samples=100, n_features=20, ... If False, the clusters are put on the vertices of a random polytope.
Read more >sklearn.datasets.make_classification fails to generate ...
1 Answer 1 ... Though its not explicitly mentioned and is confusing, the parameter weights require "proportions" of samples. It does not convert ......
Read more >Scoring Classifier Models using scikit-learn - Ben Alex Keen
Of course this doesn't provide any information about whether the model is has any false positives or false negatives.
Read more >Why Is Imbalanced Classification Difficult?
We can use the make_classification() scikit-learn function to ... or imbalanced treatment methods, focus on wrong areas of input space.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

The current description is: “The fraction of samples whose class are randomly exchanged. Larger values introduce noise in the labels and make the classification task harder.”
We are suggesting: “The fraction of samples whose class is assigned randomly. Larger values introduce noise in the labels and make the classification task harder.”
Any other suggestions?
I think we would prefer not changing the variable name, and just improve the documentation.