question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing values break feature selection with pipeline estimator

See original GitHub issue

Describe the bug

Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan" of the underlying estimator, but Pipeline doesn’t inherit that tag from a (its first, or any) step, and so keeps its default False.

Discovered at https://stackoverflow.com/q/69883401/10495893

Steps/Code to Reproduce

from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan

pipe = make_pipeline(
    SimpleImputer(),
    LogisticRegression(),
)

fs = RFE(estimator=pipe)
fs.fit(X, y)

The same error occurs with SelectKBest and SequentialFeatureSelector. A similar error occurs in SelectFromModel but at transform time rather than fit.

Expected Results

No error is thrown

Actual Results


ValueError                                Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
      1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)

5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
    220             Fitted estimator.
    221         """
--> 222         return self._fit(X, y, **fit_params)
    223 
    224     def _fit(self, X, y, step_score=None, **fit_params):

/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    235             ensure_min_features=2,
    236             force_all_finite=not tags.get("allow_nan", True),
--> 237             multi_output=True,
    238         )
    239         error_msg = (

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    574                 y = check_array(y, **check_y_params)
    575             else:
--> 576                 X, y = check_X_y(X, y, **check_params)
    577             out = X, y
    578 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    966         ensure_min_samples=ensure_min_samples,
    967         ensure_min_features=ensure_min_features,
--> 968         estimator=estimator,
    969     )
    970 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    790 
    791         if force_all_finite:
--> 792             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    793 
    794     if ensure_min_samples > 0:

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    114             raise ValueError(
    115                 msg_err.format(
--> 116                     type_err, msg_dtype if msg_dtype is not None else X.dtype
    117                 )
    118             )

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
      pip: 21.1.3
      setuptools: 57.4.0
      sklearn: 1.0.1
      numpy: 1.19.5
      scipy: 1.4.1
      Cython: 0.29.24
      pandas: 1.1.5
      matplotlib: 3.2.2
      joblib: 1.1.0
      threadpoolctl: 3.0.0

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:3
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jnothmancommented, Nov 25, 2021

Yes, might be simplest for RFE to always permit NaN. RFE only really needs to validate that X has columns that can be indexed.

However I don’t think it’s right that Pipeline defaults to an allow_nan: False state. It should behave permissive until proven otherwise

1reaction
bmreinigercommented, Nov 22, 2021

Is there a reason not to just not do any validation, and leave it to the estimator?

Trying to get a pipeline (and other composites; consider a large column transformer in this situation!) to figure out if it can handle missings seems tricky. Some transformers might accept but pass along NaNs, until later an imputer may fill them, or the final predictor may accept missings…

Read more comments on GitHub >

github_iconTop Results From Across the Web

Imputation & Feature Selection - Andreas Mueller
Here, x_score has like some missing values, I split into test and training test data set and then I drop all columns that...
Read more >
Designing a Feature Selection Pipeline in Python
Purpose: To design and develop a feature selection pipeline in Python. Materials and methods: Using Scikit-learn, we generate a Madelon-like ...
Read more >
6.1. Pipelines and composite estimators
Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in...
Read more >
How to Handle Missing Data with Python
Page 197, Feature Engineering and Selection, 2019. In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Read more >
Feature Selection Techniques tutorial
Rows with null values will be removed to avoid errors while using the dataset in the machine learning models. In [8]:.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found