question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError raised when using pandas DataFrame in SelectFromModel.fit()

See original GitHub issue

Describe the bug

When passing X to SelectFromModel.fit() where X is a pandas DatafFrame, a keyerror is raised at

https://github.com/scikit-learn/scikit-learn/blob/16625450b58f555dc3955d223f0c3b64a5686984/sklearn/feature_selection/_from_model.py#L317

This is because there is no key==0 in the DF, this only works with numpy arrays not DataFrames.
In version 1.0.2 this check was done with X.shape[1] which worked for both arrays and dataframes.

This is breaking our existing code.

Steps/Code to Reproduce

import logging

from mlxtend.classifier import LogisticRegression
from sklearn.feature_selection._from_model import SelectFromModel

import pandas as pd

df = pd.DataFrame(
    [
        ["c", 0, 3, 9, 5],
        ["d", 0, 4, 4, 6],
        ["d", 1, 15, 11, 7],
        ["c", 1, 1, 0, 9],
    ],
    columns=["a", "b", "c", "d", "e"],
)
target_col = "b"
df = df.drop(["a"], axis=1)
x = df[[x for x in df.columns if x != target_col]]
y = df[target_col]

try:
    SelectFromModel(LogisticRegression(), threshold="mean", max_features=2).fit(x, y)  # works in SKLearn v1.0.2, fails in 1.1.0
except KeyError:
    logging.exception("")

SelectFromModel(LogisticRegression(), threshold="mean", max_features=2).fit(x.values, y)

Expected Results

No error raised.

Actual Results

Traceback (most recent call last):
  File "C:\Users\e68175\AppData\Local\JetBrains\PyCharm Community Edition 2021.3.3\plugins\python-ce\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
  File "C:\Users\e68175\projects\datalab-pypf-2\venv-skl110\lib\site-packages\sklearn\feature_selection\_from_model.py", line 317, in fit
    max_val=len(X[0]),
  File "C:\Users\e68175\projects\datalab-pypf-2\venv-skl110\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\e68175\projects\datalab-pypf-2\venv-skl110\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 0

Versions

Python dependencies:
      sklearn: 1.1.0
          pip: 21.2.4
   setuptools: 58.1.0
        numpy: 1.21.6
        scipy: 1.8.0
       Cython: 0.29.28
       pandas: 1.4.2
   matplotlib: 3.5.2
       joblib: 1.1.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\Users\e68175\projects\datalab-pypf-2\venv-skl110\Lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8
       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\Users\e68175\projects\datalab-pypf-2\venv-skl110\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None
    num_threads: 8
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\Users\e68175\projects\datalab-pypf-2\venv-skl110\Lib\site-packages\scipy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll
        version: 0.3.17
threading_layer: pthreads
   architecture: Haswell
    num_threads: 8

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kchawla-picommented, May 17, 2022

Couldn’t risk pissing any of you off with a half assed issue.
You know where I live.

1reaction
ogriselcommented, May 17, 2022

BTW, thanks @kchawla-pi for the minimal reproducer, it is much appreciated.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python Mlens Ensemble: KeyError: "None of [Int64Index ...
My goal is to use a dataframe structure rather than a Numpy array. ... clf = clf.fit(X, y) self.selector = SelectFromModel(clf, prefit=True, ...
Read more >
sklearn.feature_selection.SelectFromModel
The base estimator from which the transformer is built. This can be both a fitted (if prefit is set to True) or a...
Read more >
Feature selection with logit, SVC, DT, RF - Kaggle
This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: ...
Read more >
Feature Importance and Feature Selection With XGBoost in ...
How to plot feature importance in Python calculated by the XGBoost model. ... call the transform() method on the SelectFromModel instance to ...
Read more >
Using random forest for selecting variables returns the entire ...
You don't need the for loop at all. def feature_encoding(df, categorical_list): # One Hot Encoding the columns gathered in ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found