Inconsistent column references with columntransformer for text/numeric cols
See original GitHub issueDescription
A small two column dataset with a text column and a numeric column requires inconsistent list notation usage.
I found it raised here but I found the errors quite confusing even once I solved the issue.
Steps/Code to Reproduce
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],
"c":[1,2]})
clmn = ColumnTransformer([
("tfidf", TfidfVectorizer(min_df=0), "a"),
("norm", Normalizer(norm='l1'), "c") #errors
#("norm", Normalizer(norm='l1'), ["c"]) #code executes as expected
])
clmn.fit_transform(dataset)
#### Expected Results
```python
array([[0.44943642, 0.6316672 , 0. , 0. , 0.6316672 ,
1. ],
[0.44943642, 0. , 0.6316672 , 0.6316672 , 0. ,
1. ]])
Actual Results
ValueError: 1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.
Versions
System: python: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\evanmiller\AppData\Local\Continuum\anaconda3\envs\capco\python.exe machine: Windows-10-10.0.16299-SP0
BLAS: macros: lib_dirs: cblas_libs: cblas
Python deps: pip: 19.1.1 setuptools: 41.0.1 sklearn: 0.20.3 numpy: 1.16.4 scipy: 1.3.0 Cython: None pandas: 0.24.2
I was wondering if there was any explanation as to why this might happen? The single text column must have no [/] around it, while the other column requires it.
If I understood more about why it’s happening I’d be happy to write a more informative error message if you think that’s the right call.
Evan
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)

Top Related StackOverflow Question
FMost Scikit-learn estimators expect each sample to be represented as a numeric vector, i.e. a set of columns from a DataFrame. Vectorizers are the exception, where a sample is represented by a string of text (or a filename).
I’m not convinced that you need to build such a sophisticated example, just try to write some prose to guide the users.