Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer behavior for negative column indexes

See original GitHub issue

Description

The behavior of ColumnTransformer when negative integers are passed as column indexes is not clear.

Steps/Code to Reproduce

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

X = np.random.randn(2, 2)
X_categories = np.array([[1], [2]])
X = np.concatenate([X, X_categories], axis=1)

print('---- With negative index ----')
ohe = OneHotEncoder(categories='auto')
tf_1 = ColumnTransformer([('ohe', ohe, [-1])], remainder='passthrough')
print(tf_1.fit_transform(X))

print('---- With positive index ----')
tf_2 = ColumnTransformer([('ohe', ohe, [2])], remainder='passthrough')
print(tf_2.fit_transform(X))

Expected Results

The first transformer tf_1 should either raise an error or give the same result as the second transformer tf_2

Actual Results

---- With negative index ----
[[ 1.          0.          0.10600662 -0.46707426  1.        ]
 [ 0.          1.         -1.33177629  2.29186299  2.        ]]
---- With positive index ----
[[ 1.          0.          0.10600662 -0.46707426]
 [ 0.          1.         -1.33177629  2.29186299]]

Issue Analytics

State:
Created 5 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, Jan 10, 2019

It is the validation of the remainder that is going wrong:

In [15]: tf_1._remainder                                                                                                                                                                                            
Out[15]: ('remainder', 'passthrough', [0, 1, 2])   <--- wrong

In [16]: tf_2._remainder                                                                                                                                                                                            
Out[16]: ('remainder', 'passthrough', [0, 1])

This is because the set operation here to get remaining_idx does not work with negative indices:

https://github.com/scikit-learn/scikit-learn/blob/354c8c3bc3e36c69021713da66e7fa2f6cb07756/sklearn/compose/_column_transformer.py#L298-L304

Maybe we should convert the negative indices to positive ones in _get_column_indices ?

1reaction

jnothmancommented, Jan 9, 2019

I think we should allow negative indices, if only because we are supporting various other numpy indexing syntaxes and users would expect it. Current behaviour doesn’t look so good!

Top Results From Across the Web

using ColumnTransformer for predicting values - Stack Overflow

I am currently using a column transformer for training and testing the model and it works perfect (code shown below):

sklearn.compose.ColumnTransformer

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.

Get column name after fitting the machine learning pipeline

Create ColumnTransformer to apply pipeline for each column typefrom sklearn.compose import ColumnTransformercol_trans = ColumnTransformer(transformers=[

Extracting, transforming and selecting features - Apache Spark

This is done using the hashing trick to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns....

Source code for sklearn.compose._column_transformer

class ColumnTransformer(TransformerMixin, _BaseComposition): """Applies transformers to columns ... slice or callable Indexes the data on its second axis.