question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer behavior for negative column indexes

See original GitHub issue

Description

The behavior of ColumnTransformer when negative integers are passed as column indexes is not clear.

Steps/Code to Reproduce

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

X = np.random.randn(2, 2)
X_categories = np.array([[1], [2]])
X = np.concatenate([X, X_categories], axis=1)

print('---- With negative index ----')
ohe = OneHotEncoder(categories='auto')
tf_1 = ColumnTransformer([('ohe', ohe, [-1])], remainder='passthrough')
print(tf_1.fit_transform(X))

print('---- With positive index ----')
tf_2 = ColumnTransformer([('ohe', ohe, [2])], remainder='passthrough')
print(tf_2.fit_transform(X))

Expected Results

The first transformer tf_1 should either raise an error or give the same result as the second transformer tf_2

Actual Results

---- With negative index ----
[[ 1.          0.          0.10600662 -0.46707426  1.        ]
 [ 0.          1.         -1.33177629  2.29186299  2.        ]]
---- With positive index ----
[[ 1.          0.          0.10600662 -0.46707426]
 [ 0.          1.         -1.33177629  2.29186299]]

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, Jan 10, 2019

It is the validation of the remainder that is going wrong:

In [15]: tf_1._remainder                                                                                                                                                                                            
Out[15]: ('remainder', 'passthrough', [0, 1, 2])   <--- wrong

In [16]: tf_2._remainder                                                                                                                                                                                            
Out[16]: ('remainder', 'passthrough', [0, 1])

This is because the set operation here to get remaining_idx does not work with negative indices:

https://github.com/scikit-learn/scikit-learn/blob/354c8c3bc3e36c69021713da66e7fa2f6cb07756/sklearn/compose/_column_transformer.py#L298-L304

Maybe we should convert the negative indices to positive ones in _get_column_indices ?

1reaction
jnothmancommented, Jan 9, 2019

I think we should allow negative indices, if only because we are supporting various other numpy indexing syntaxes and users would expect it. Current behaviour doesn’t look so good!

Read more comments on GitHub >

github_iconTop Results From Across the Web

using ColumnTransformer for predicting values - Stack Overflow
I am currently using a column transformer for training and testing the model and it works perfect (code shown below):
Read more >
sklearn.compose.ColumnTransformer
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name.
Read more >
Get column name after fitting the machine learning pipeline
Create ColumnTransformer to apply pipeline for each column typefrom sklearn.compose import ColumnTransformercol_trans = ColumnTransformer(transformers=[
Read more >
Extracting, transforming and selecting features - Apache Spark
This is done using the hashing trick to map features to indices in the feature vector. The FeatureHasher transformer operates on multiple columns....
Read more >
Source code for sklearn.compose._column_transformer
class ColumnTransformer(TransformerMixin, _BaseComposition): """Applies transformers to columns ... slice or callable Indexes the data on its second axis.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found