Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent column references with columntransformer for text/numeric cols

See original GitHub issue

Description

A small two column dataset with a text column and a numeric column requires inconsistent list notation usage.

I found it raised here but I found the errors quite confusing even once I solved the issue.

Steps/Code to Reproduce


import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],
                        "c":[1,2]})

clmn = ColumnTransformer([
    ("tfidf", TfidfVectorizer(min_df=0), "a"),
    ("norm", Normalizer(norm='l1'), "c") #errors
    #("norm", Normalizer(norm='l1'), ["c"]) #code executes as expected
])
clmn.fit_transform(dataset)

#### Expected Results
```python
array([[0.44943642, 0.6316672 , 0.        , 0.        , 0.6316672 ,
        1.        ],
       [0.44943642, 0.        , 0.6316672 , 0.6316672 , 0.        ,
        1.        ]])

Actual Results

ValueError: 1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.

Versions

System: python: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\evanmiller\AppData\Local\Continuum\anaconda3\envs\capco\python.exe machine: Windows-10-10.0.16299-SP0

BLAS: macros: lib_dirs: cblas_libs: cblas

Python deps: pip: 19.1.1 setuptools: 41.0.1 sklearn: 0.20.3 numpy: 1.16.4 scipy: 1.3.0 Cython: None pandas: 0.24.2

I was wondering if there was any explanation as to why this might happen? The single text column must have no [/] around it, while the other column requires it.

If I understood more about why it’s happening I’d be happy to write a more informative error message if you think that’s the right call.

Evan

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Jun 10, 2019

FMost Scikit-learn estimators expect each sample to be represented as a numeric vector, i.e. a set of columns from a DataFrame. Vectorizers are the exception, where a sample is represented by a string of text (or a filename).

0reactions

jnothmancommented, Jul 9, 2019

I’m not convinced that you need to build such a sophisticated example, just try to write some prose to guide the users.

Top Results From Across the Web

How to Use the ColumnTransformer for Data Preparation

In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed ......

Scikit-learn Column Transformer does not return back feature ...

So is there anyway to retain my columns' names after doing this transformation? I want to incorporate this transformer into a pipeline so...

Append Tables with Inconsistent Column Names with #Excel ...

If you know Power Query, you should know that it's super easy to combine files in a folder, as long as all the...

I like - Semanlink

In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is...

sklearn.compose.ColumnTransformer

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar...