SelectFromModel: max_features can't be greater than number of features
See original GitHub issueDescribe the bug
When I define a SelectFromModel
instance like here:
SelectFromModel(RandomForestClassifier(), max_features=100)
and the number of total features is less than 100, then a ValueError
is raised:
ValueError: 'max_features' should be 0 and 10 features.Got 100 instead.
I consider this a bug as for example in my pipeline this feature selector is preceded by other feature selectors like low VarianceThreshold
. Point being it is never known how many features will be left when this point is reached. If the value is bigger than available features it should just keep all of them and not throw an error.
Steps/Code to Reproduce
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
x,y = make_classification(n_features=10, n_informative=8)
sfm = SelectFromModel(RandomForestClassifier(), max_features=100)
sfm.fit(x,y)
Expected Results
If the value is bigger than available features it should just keep all of them and not throw an error.
Actual Results
ValueError: 'max_features' should be 0 and 10 features.Got 100 instead.
Versions
0.23.2 0.24.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (8 by maintainers)
Top Results From Across the Web
sklearn.feature_selection.SelectFromModel
If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_features(X) .
Read more >How does SelectFromModel from scikit-learn select features?
To only select based on max_features, set threshold=-np.inf. I found the above text in the documentation sklearn.feature_selection.
Read more >SelectFromModel vs RFE - huge difference in model ...
In any case, selecting features based on their Gini importance (Mean Decrease in Impurity - MDI) has started falling out of fashion, mainly ......
Read more >Sklearn SelectFromModel for Feature Importance
Transform the training data to the dataset consisting of features value whose importance is greater than the threshold value. Create the ...
Read more >Comprehensive Guide on Feature Selection
Feature Selection is the process of selecting optimal number of features from a larger set of features. There are several advantages of this...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
What about accepting a callable that would take the X and should return an integer.
I still think that adding support to a float could be nice because this is a common API in other estimators and one would expect to have it there as well.
Gave an initial implementation in PR #22356 if anyone is interested in taking a look!