SelectFromModel function in scikit 1.0.1 does not work properly with catboost and caching
See original GitHub issueDescribe the bug
Problem: SelectFromModel function in scikit 1.01 does not work properly with catboost as it seems to delete column names, making the return get_feature_names_out not return proper column names. This leads to additional problems when combining catboost and Scikit Learn in a pipeline and caching during hyperparameter optimization. For other models it works well.
Here is an reproducible example and the errror message from caching below it:
Steps/Code to Reproduce
from sklearn.svm import LinearSVR
from sklearn.datasets import load_iris
from catboost import CatBoostRegressor
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel
##################################################
#get the data and assign column names
##################################################
iris = load_iris()
X = pd.DataFrame(data= np.c_[iris['data']])
X.columns = ["a","b","c","d"]
y = iris.target
##########################################################
##test with catboost
##########################################################
#train the selector
selector = SelectFromModel(estimator=CatBoostRegressor())
selector.fit(X, y)
#show the selected columns
selector.get_feature_names_out()
#array(['x2', 'x3'], dtype=object) instead of "b" and "c"
#the oroginal column names have been replaced with x2 and x3
#if the selector is applied
selector.transform(X.iloc[1:2,:])
#I get the following warning
#X has feature names, but SelectFromModel was fitted without feature names
##############################################################
##Now the same with SVR, here it works
##############################################################
#train the selector
selector = SelectFromModel(estimator=LinearSVR())
selector.fit(X, y)
#the proper names are being returned
selector.get_feature_names_out()
#array(['d'], dtype=object) and "d" is a proper column name
#apply the selector
selector.transform(X.iloc[1:2,:])
#no warning is given
Error message when using in hyperparameter tuning with caching
File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 647, in transform
Xt = transform.transform(Xt)
File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\feature_selection\_base.py", line 83, in transform
X = self._validate_data(
File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\base.py", line 580, in _validate_data
self._check_n_features(X, reset=reset)
File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\base.py", line 395, in _check_n_features
raise ValueError(
ValueError: X has 290 features, but SelectFromModel is expecting 0 features as input.
Actual Results
As no proper column names are being returned the caching does not seem to be able to match the names properly.
Versions
System: python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 06:25:23) [MSC v.1916 64 bit (AMD64)] executable: C:\LocalTools\Miniconda\envs\my-rdkit-env\python.exe machine: Windows-10-10.0.18363-SP0 Python dependencies: pip: 21.2.2 setuptools: 57.4.0 sklearn: 1.0.1 numpy: 1.21.1 scipy: 1.6.3 Cython: None pandas: 1.3.1 matplotlib: 3.4.2 joblib: 1.1.0 threadpoolctl: 2.2.0 Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
There are two issues here:
get_feature_names_out
is incorrect when the inner estimator does not supportfeature_names_in_
.transform
warns when the inner estimator does not supportfeature_names_in_
.I think the most lenient long term solution is to “delegate when possible”. If the inner estimator does not support
feature_names_in_
, the meta-estimator will get the feature names and validates them in non-fit
calls. WDYT?@glemaitre will put a reproducible example together then