Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer fails with mixed input types

See original GitHub issue

Description

I found this issue while working with OneHotEncoder which output is a sparse matrix by default.

In order to reproduce the issue using mock data I am forcing sparse_threshold=1.0. The issue does not occur if the input of ColumnTransformer are all sparse matrix or all dense matrix, nor if the output is a dense matrix. It happens when the input is mixed (a sparse matrix and a dense matrix for example) and the output is sparse.

Note below that one the inputs has columns of mixed types. This is key in reproducing the issue as if the types were shared, it will work as expected.

Maybe related: scikit-learn-contrib/sklearn-pandas#51

Steps/Code to Reproduce

df = pd.DataFrame([['a',1,True,0],['b',2,False,0]],
                  columns=['categorical', 'numerical', 'boolean', 'shall_not_pass'])

ct = make_column_transformer(
    (['categorical'], OneHotEncoder()),
    (['numerical', 'boolean'], 'passthrough')
)

ct.sparse_threshold=1.0

ct.fit_transform(df)

Expected Results

<2x4 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-ee6fd5a237c5> in <module>()
----> 1 ct.fit_transform(df)

/home/matias/scikit-learn/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    447         self._validate_output(Xs)
    448 
--> 449         return self._hstack(list(Xs))
    450 
    451     def transform(self, X):

/home/matias/scikit-learn/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    488         """
    489         if self.sparse_output_:
--> 490             return sparse.hstack(Xs).tocsr()
    491         else:
    492             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]

/home/matias/.pyenv/versions/3.5.2/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    462 
    463     """
--> 464     return bmat([blocks], format=format, dtype=dtype)
    465 
    466 

/home/matias/.pyenv/versions/3.5.2/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    598     if dtype is None:
    599         all_dtypes = [blk.dtype for blk in blocks[block_mask]]
--> 600         dtype = upcast(*all_dtypes) if all_dtypes else None
    601 
    602     row_offsets = np.append(0, np.cumsum(brow_lengths))

/home/matias/.pyenv/versions/3.5.2/lib/python3.5/site-packages/scipy/sparse/sputils.py in upcast(*args)
     50             return t
     51 
---> 52     raise TypeError('no supported conversion for types: %r' % (args,))
     53 
     54 

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

Versions

Linux-4.17.11-arch1-x86_64-with-arch-Arch-Linux
Python 3.5.2 (default, Nov 17 2016, 21:45:04) 
[GCC 6.2.1 20160830]
NumPy 1.15.1
SciPy 1.1.0
Scikit-Learn 0.20.dev0

Issue Analytics

State:
Created 5 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Sep 19, 2018

I agree that these ColumnTranformer limitations could be addressed in 0.20.1.

0reactions

chkoarcommented, Feb 23, 2019

This issue was fixed in #12200

Top Results From Across the Web

Column Transformer with Mixed Types - Scikit-learn

This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using ColumnTransformer.

Transfomers for mixed data types - python - Stack Overflow

When it tries to concatenate that to the 1047-row numerical output, it fails. Attempt 2: FeatureUnion doesn't have the same input format as ......

How to Use the ColumnTransformer for Data Preparation

It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but...

Using ColumnTransformer to combine data processing steps

Assuming I have my input and target DataFrames (X_train, y_train) already loaded: from sklearn.compose import ColumnTransformer

Column Transformer and Machine Learning Pipelines

Defining Problem; Column Transformer Architecture ... which the output of the first transformer becomes the input for the next transformer.