ColumnTransformer: integer column index in dataframes unexpected behaviour and error (column selectors vs _get_column_indices)
See original GitHub issueDescribe the bug
Unexpected behaviour when using integer column index in a DataFrame, other than natural ordering [0, 1, …].
select_int = make_column_selector(dtype_include=np.int_)
ct = ColumnTransformer([('t2', Normalizer(norm="l1"), select_int)])
df1 = pd.DataFrame({'1': [1, 2, 3], '2': [9, 8, 7]})
df2 = pd.DataFrame({1: [1, 2, 3], 2: [9, 8, 7]})
ct.fit_transform(df1) # OK
ct.fit_transform(df2) # IndexError
The only difference between df1
and df2
is the type of column index. In my opinion, the results for these dataframes must be similar, but an error is raised for the latter.
As far as I could see, the problem stems from semantic ambiguity as to when to use iloc
-based indexing vs loc
-based indexing. In _get_column_indices
L382 this decision is based on the type of index and not on the type of the array. Whichever criterion is chosen, if it followed consistently in column selectors, the error shall be avoided. Probably.
Steps/Code to Reproduce
(See above)
Expected Results
(See above)
Actual Results
(See above)
Versions
Python dependencies:
pip: 22.0.3
setuptools: 60.8.1
sklearn: 1.1.dev0
numpy: 1.22.2
scipy: 1.8.0
Cython: 0.29.27
pandas: 1.3.5
matplotlib: 3.5.0
joblib: 1.1.0
threadpoolctl: 3.1.0
commit b28c5bba66529217ceedd497201a684e5d35b73c (upstream/main, origin/main, origin/HEAD, main)
Author: Thomas J. Fan
Date: Tue Feb 15 11:46:54 2022 -0500
FIX DummyRegressor overriding constant (#22486)
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (6 by maintainers)
Top Results From Across the Web
sklearn.compose.ColumnTransformer
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.
Read more >How to Use the ColumnTransformer for Data Preparation
The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation ...
Read more >AttributeError when using ColumnTransformer into a pipeline
ColumnTransformer returns numpy.array , so it can't have column attribute (as indicated by your error). If I may suggest a different ...
Read more >Seven ways to select columns using ColumnTransformer
There are SEVEN ways to select columns using ColumnTransformer :1. column name2. integer position3. slice4. boolean mask5. regex pattern6.
Read more >KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yup! That is the goal. The other benefit of
use_index_directly
is that it allows for using panda’s multiindex: https://github.com/scikit-learn/scikit-learn/issues/13781Selecting columns based on two different types is not well supported and would require a design discussion. It also breaks backward compatibility.
If this is a major issue, I think we could add a
use_index_directly
parameter toColumnTransformer
that turns off these semantics and selects from the column directly. For example:What do you think?