question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inconsistent column references with columntransformer for text/numeric cols

See original GitHub issue

Description

A small two column dataset with a text column and a numeric column requires inconsistent list notation usage.

I found it raised here but I found the errors quite confusing even once I solved the issue.

Steps/Code to Reproduce


import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],
                        "c":[1,2]})

clmn = ColumnTransformer([
    ("tfidf", TfidfVectorizer(min_df=0), "a"),
    ("norm", Normalizer(norm='l1'), "c") #errors
    #("norm", Normalizer(norm='l1'), ["c"]) #code executes as expected
])
clmn.fit_transform(dataset)

#### Expected Results
```python
array([[0.44943642, 0.6316672 , 0.        , 0.        , 0.6316672 ,
        1.        ],
       [0.44943642, 0.        , 0.6316672 , 0.6316672 , 0.        ,
        1.        ]])

Actual Results

ValueError: 1D data passed to a transformer that expects 2D data. Try to specify the column selection as a list of one item instead of a scalar.

Versions

System: python: 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\evanmiller\AppData\Local\Continuum\anaconda3\envs\capco\python.exe machine: Windows-10-10.0.16299-SP0

BLAS: macros: lib_dirs: cblas_libs: cblas

Python deps: pip: 19.1.1 setuptools: 41.0.1 sklearn: 0.20.3 numpy: 1.16.4 scipy: 1.3.0 Cython: None pandas: 0.24.2

I was wondering if there was any explanation as to why this might happen? The single text column must have no [/] around it, while the other column requires it.

If I understood more about why it’s happening I’d be happy to write a more informative error message if you think that’s the right call.

Evan

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Jun 10, 2019

FMost Scikit-learn estimators expect each sample to be represented as a numeric vector, i.e. a set of columns from a DataFrame. Vectorizers are the exception, where a sample is represented by a string of text (or a filename).

0reactions
jnothmancommented, Jul 9, 2019

I’m not convinced that you need to build such a sophisticated example, just try to write some prose to guide the users.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Use the ColumnTransformer for Data Preparation
In this tutorial, you will discover how to use the ColumnTransformer to selectively apply data transforms to columns in a dataset with mixed ......
Read more >
Scikit-learn Column Transformer does not return back feature ...
So is there anyway to retain my columns' names after doing this transformation? I want to incorporate this transformer into a pipeline so...
Read more >
Append Tables with Inconsistent Column Names with #Excel ...
If you know Power Query, you should know that it's super easy to combine files in a folder, as long as all the...
Read more >
I like - Semanlink
In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is...
Read more >
sklearn.compose.ColumnTransformer
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found