question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ColumnTransformer fails with mixed input types

See original GitHub issue

Description

I found this issue while working with OneHotEncoder which output is a sparse matrix by default.

In order to reproduce the issue using mock data I am forcing sparse_threshold=1.0. The issue does not occur if the input of ColumnTransformer are all sparse matrix or all dense matrix, nor if the output is a dense matrix. It happens when the input is mixed (a sparse matrix and a dense matrix for example) and the output is sparse.

Note below that one the inputs has columns of mixed types. This is key in reproducing the issue as if the types were shared, it will work as expected.

Maybe related: scikit-learn-contrib/sklearn-pandas#51

Steps/Code to Reproduce

df = pd.DataFrame([['a',1,True,0],['b',2,False,0]],
                  columns=['categorical', 'numerical', 'boolean', 'shall_not_pass'])

ct = make_column_transformer(
    (['categorical'], OneHotEncoder()),
    (['numerical', 'boolean'], 'passthrough')
)

ct.sparse_threshold=1.0

ct.fit_transform(df)

Expected Results

<2x4 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

Actual Results

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-ee6fd5a237c5> in <module>()
----> 1 ct.fit_transform(df)

/home/matias/scikit-learn/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
    447         self._validate_output(Xs)
    448 
--> 449         return self._hstack(list(Xs))
    450 
    451     def transform(self, X):

/home/matias/scikit-learn/sklearn/compose/_column_transformer.py in _hstack(self, Xs)
    488         """
    489         if self.sparse_output_:
--> 490             return sparse.hstack(Xs).tocsr()
    491         else:
    492             Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]

/home/matias/.pyenv/versions/3.5.2/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    462 
    463     """
--> 464     return bmat([blocks], format=format, dtype=dtype)
    465 
    466 

/home/matias/.pyenv/versions/3.5.2/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    598     if dtype is None:
    599         all_dtypes = [blk.dtype for blk in blocks[block_mask]]
--> 600         dtype = upcast(*all_dtypes) if all_dtypes else None
    601 
    602     row_offsets = np.append(0, np.cumsum(brow_lengths))

/home/matias/.pyenv/versions/3.5.2/lib/python3.5/site-packages/scipy/sparse/sputils.py in upcast(*args)
     50             return t
     51 
---> 52     raise TypeError('no supported conversion for types: %r' % (args,))
     53 
     54 

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

Versions

Linux-4.17.11-arch1-x86_64-with-arch-Arch-Linux
Python 3.5.2 (default, Nov 17 2016, 21:45:04) 
[GCC 6.2.1 20160830]
NumPy 1.15.1
SciPy 1.1.0
Scikit-Learn 0.20.dev0

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Sep 19, 2018

I agree that these ColumnTranformer limitations could be addressed in 0.20.1.

0reactions
chkoarcommented, Feb 23, 2019

This issue was fixed in #12200

Read more comments on GitHub >

github_iconTop Results From Across the Web

Column Transformer with Mixed Types - Scikit-learn
This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using ColumnTransformer.
Read more >
Transfomers for mixed data types - python - Stack Overflow
When it tries to concatenate that to the 1047-row numerical output, it fails. Attempt 2: FeatureUnion doesn't have the same input format as ......
Read more >
How to Use the ColumnTransformer for Data Preparation
It can be challenging when you have a dataset with mixed types and you want to selectively apply data transforms to some, but...
Read more >
Using ColumnTransformer to combine data processing steps
Assuming I have my input and target DataFrames (X_train, y_train) already loaded: from sklearn.compose import ColumnTransformer
Read more >
Column Transformer and Machine Learning Pipelines
Defining Problem; Column Transformer Architecture ... which the output of the first transformer becomes the input for the next transformer.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found