Missing values break feature selection with pipeline estimator
See original GitHub issueDescribe the bug
Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan"
of the underlying estimator
, but Pipeline
doesn’t inherit that tag from a (its first, or any) step, and so keeps its default False
.
Discovered at https://stackoverflow.com/q/69883401/10495893
Steps/Code to Reproduce
from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan
pipe = make_pipeline(
SimpleImputer(),
LogisticRegression(),
)
fs = RFE(estimator=pipe)
fs.fit(X, y)
The same error occurs with SelectKBest
and SequentialFeatureSelector
. A similar error occurs in SelectFromModel
but at transform
time rather than fit
.
Expected Results
No error is thrown
Actual Results
ValueError Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)
5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
220 Fitted estimator.
221 """
--> 222 return self._fit(X, y, **fit_params)
223
224 def _fit(self, X, y, step_score=None, **fit_params):
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
235 ensure_min_features=2,
236 force_all_finite=not tags.get("allow_nan", True),
--> 237 multi_output=True,
238 )
239 error_msg = (
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
574 y = check_array(y, **check_y_params)
575 else:
--> 576 X, y = check_X_y(X, y, **check_params)
577 out = X, y
578
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
966 ensure_min_samples=ensure_min_samples,
967 ensure_min_features=ensure_min_features,
--> 968 estimator=estimator,
969 )
970
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
790
791 if force_all_finite:
--> 792 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
793
794 if ensure_min_samples > 0:
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
114 raise ValueError(
115 msg_err.format(
--> 116 type_err, msg_dtype if msg_dtype is not None else X.dtype
117 )
118 )
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Versions
System:
python: 3.7.12 (default, Sep 10 2021, 00:21:48) [GCC 7.5.0]
executable: /usr/bin/python3
machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic
Python dependencies:
pip: 21.1.3
setuptools: 57.4.0
sklearn: 1.0.1
numpy: 1.19.5
scipy: 1.4.1
Cython: 0.29.24
pandas: 1.1.5
matplotlib: 3.2.2
joblib: 1.1.0
threadpoolctl: 3.0.0
Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Imputation & Feature Selection - Andreas Mueller
Here, x_score has like some missing values, I split into test and training test data set and then I drop all columns that...
Read more >Designing a Feature Selection Pipeline in Python
Purpose: To design and develop a feature selection pipeline in Python. Materials and methods: Using Scikit-learn, we generate a Madelon-like ...
Read more >6.1. Pipelines and composite estimators
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in...
Read more >How to Handle Missing Data with Python
Page 197, Feature Engineering and Selection, 2019. In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Read more >Feature Selection Techniques tutorial
Rows with null values will be removed to avoid errors while using the dataset in the machine learning models. In [8]:.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, might be simplest for RFE to always permit NaN. RFE only really needs to validate that X has columns that can be indexed.
However I don’t think it’s right that Pipeline defaults to an allow_nan: False state. It should behave permissive until proven otherwise
Is there a reason not to just not do any validation, and leave it to the estimator?
Trying to get a pipeline (and other composites; consider a large column transformer in this situation!) to figure out if it can handle missings seems tricky. Some transformers might accept but pass along NaNs, until later an imputer may fill them, or the final predictor may accept missings…