Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Missing values break feature selection with pipeline estimator

See original GitHub issue

Describe the bug

Feature selectors were adapted to allow missing values if their underlying estimator does in #11635, but that does not work if the estimator is a pipeline. In the traceback (below), the reason becomes apparent: the fix in the above mentioned PR checks the tag "allow_nan" of the underlying estimator, but Pipeline doesn’t inherit that tag from a (its first, or any) step, and so keeps its default False.

Discovered at https://stackoverflow.com/q/69883401/10495893

Steps/Code to Reproduce

from sklearn.feature_selection import RFE, SelectFromModel, SelectKBest, SequentialFeatureSelector
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
nan_inds = np.random.randint(2, size=X.shape) < 0.1
X[nan_inds] = np.nan

pipe = make_pipeline(
    SimpleImputer(),
    LogisticRegression(),
)

fs = RFE(estimator=pipe)
fs.fit(X, y)

The same error occurs with SelectKBest and SequentialFeatureSelector. A similar error occurs in SelectFromModel but at transform time rather than fit.

Expected Results

No error is thrown

Actual Results


ValueError                                Traceback (most recent call last)
<ipython-input-11-2744b13fcc5c> in <module>()
      1 fs = RFE(estimator=pipe)
----> 2 fs.fit(X, y)

5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in fit(self, X, y, **fit_params)
    220             Fitted estimator.
    221         """
--> 222         return self._fit(X, y, **fit_params)
    223 
    224     def _fit(self, X, y, step_score=None, **fit_params):

/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_rfe.py in _fit(self, X, y, step_score, **fit_params)
    235             ensure_min_features=2,
    236             force_all_finite=not tags.get("allow_nan", True),
--> 237             multi_output=True,
    238         )
    239         error_msg = (

/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    574                 y = check_array(y, **check_y_params)
    575             else:
--> 576                 X, y = check_X_y(X, y, **check_params)
    577             out = X, y
    578 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    966         ensure_min_samples=ensure_min_samples,
    967         ensure_min_features=ensure_min_features,
--> 968         estimator=estimator,
    969     )
    970 

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    790 
    791         if force_all_finite:
--> 792             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    793 
    794     if ensure_min_samples > 0:

/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    114             raise ValueError(
    115                 msg_err.format(
--> 116                     type_err, msg_dtype if msg_dtype is not None else X.dtype
    117                 )
    118             )

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Versions

System:
    python: 3.7.12 (default, Sep 10 2021, 00:21:48)  [GCC 7.5.0]
executable: /usr/bin/python3
   machine: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
      pip: 21.1.3
      setuptools: 57.4.0
      sklearn: 1.0.1
      numpy: 1.19.5
      scipy: 1.4.1
      Cython: 0.29.24
      pandas: 1.1.5
      matplotlib: 3.2.2
      joblib: 1.1.0
      threadpoolctl: 3.0.0

Built with OpenMP: True

Issue Analytics

State:
Created 2 years ago
Reactions:3
Comments:5 (5 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Nov 25, 2021

Yes, might be simplest for RFE to always permit NaN. RFE only really needs to validate that X has columns that can be indexed.

However I don’t think it’s right that Pipeline defaults to an allow_nan: False state. It should behave permissive until proven otherwise

1reaction

bmreinigercommented, Nov 22, 2021

Is there a reason not to just not do any validation, and leave it to the estimator?

Trying to get a pipeline (and other composites; consider a large column transformer in this situation!) to figure out if it can handle missings seems tricky. Some transformers might accept but pass along NaNs, until later an imputer may fill them, or the final predictor may accept missings…