question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SelectFromModel function in scikit 1.0.1 does not work properly with catboost and caching

See original GitHub issue

Describe the bug

Problem: SelectFromModel function in scikit 1.01 does not work properly with catboost as it seems to delete column names, making the return get_feature_names_out not return proper column names. This leads to additional problems when combining catboost and Scikit Learn in a pipeline and caching during hyperparameter optimization. For other models it works well.

Here is an reproducible example and the errror message from caching below it:

Steps/Code to Reproduce

from sklearn.svm import LinearSVR
from sklearn.datasets import load_iris
from catboost import CatBoostRegressor
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel

##################################################
#get the data and assign column names
##################################################

iris = load_iris()
X = pd.DataFrame(data= np.c_[iris['data']])
X.columns = ["a","b","c","d"]
y = iris.target

##########################################################
##test with catboost
##########################################################

#train the selector
selector = SelectFromModel(estimator=CatBoostRegressor())
selector.fit(X, y)
#show the selected columns
selector.get_feature_names_out()
#array(['x2', 'x3'], dtype=object) instead of "b" and "c"
#the oroginal column names have been replaced with x2 and x3

#if the selector is applied
selector.transform(X.iloc[1:2,:])
#I get the following warning
#X has feature names, but SelectFromModel was fitted without feature names

##############################################################
##Now the same with SVR, here it works
##############################################################

#train the selector
selector = SelectFromModel(estimator=LinearSVR())
selector.fit(X, y)

#the proper names are being returned
selector.get_feature_names_out()
#array(['d'], dtype=object) and "d" is a proper column name

#apply the selector
selector.transform(X.iloc[1:2,:])
#no warning is given

Error message when using in hyperparameter tuning with caching

File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\pipeline.py", line 647, in transform
    Xt = transform.transform(Xt)
  File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\feature_selection\_base.py", line 83, in transform
    X = self._validate_data(
  File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\base.py", line 580, in _validate_data
    self._check_n_features(X, reset=reset)
  File "C:\LocalTools\Miniconda\envs\my-rdkit-env\lib\site-packages\sklearn\base.py", line 395, in _check_n_features
    raise ValueError(
ValueError: X has 290 features, but SelectFromModel is expecting 0 features as input.

Actual Results

As no proper column names are being returned the caching does not seem to be able to match the names properly.

Versions

System: python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 06:25:23) [MSC v.1916 64 bit (AMD64)] executable: C:\LocalTools\Miniconda\envs\my-rdkit-env\python.exe machine: Windows-10-10.0.18363-SP0 Python dependencies: pip: 21.2.2 setuptools: 57.4.0 sklearn: 1.0.1 numpy: 1.21.1 scipy: 1.6.3 Cython: None pandas: 1.3.1 matplotlib: 3.4.2 joblib: 1.1.0 threadpoolctl: 2.2.0 Built with OpenMP: True

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
thomasjpfancommented, Dec 11, 2021

There are two issues here:

  1. get_feature_names_out is incorrect when the inner estimator does not support feature_names_in_.
  2. transform warns when the inner estimator does not support feature_names_in_.

I think the most lenient long term solution is to “delegate when possible”. If the inner estimator does not support feature_names_in_, the meta-estimator will get the feature names and validates them in non-fit calls. WDYT?

0reactions
ThomasWolf0701commented, Feb 1, 2022

@glemaitre will put a reproducible example together then

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.feature_selection.SelectFromModel
SelectFromModel : Model-based and sequential feature selection Model-based and ... Allows NaN/Inf in the input if the underlying estimator does as well.
Read more >
select_features - CatBoost
Purpose. Select the best features and drop harmful features from the dataset. Method call format.
Read more >
EvalML Documentation - Alteryx
EvalML is an AutoML library that builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions. Combined ...
Read more >
Private 10위, Public 0.50123/ NN and ML - DACON
model: blending catboost, xgboost, rf, et, lgbm ... !unzip /content/open.zip ... ConvTranspose2d): init.xavier_normal_(m.weight.data) if m.bias is not None: ...
Read more >
Titanic Prediction (LightGBM RandomForest) | Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data ... WARNING : Pclass is not object or category WARNING :...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found