question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TfidfVectorizer handles multiple text columns

See original GitHub issue

It’s really nice that transformers such as sklearn.preprocessing.OneHotEncoder and sklearn.preprocessing.StandardScaler can operate on multiple data columns simultaneously.

sklearn.feature_extraction.text.TfidfVectorizer on the other hand, can only process one column at a time, so you need to make a new transformer for each text column in your dataset. This can get a little tedious and in particular makes pipelines more verbose.

It’d be nice if TfidfVectorizer could also operate on multiple text columns, using the same settings for each column, perhaps with an option to make one vocabulary per column, or use a shared vocabulary across all the columns.

It might be easiest to implement this as a new class that wraps TfidfVectorizer sagemaker-scikit-learn-extension takes this approach.

If this seems like a good idea, I’d be happy to make a PR.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:5
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

4reactions
zachmayercommented, Jun 15, 2020

@jnothman @amueller Here’s an example pipeline I was debugging today. It follows a common failure pattern I’ve seen a few times now:

  1. Categorical preprocessing handles multiple columns
  2. Numeric preprocessing handles multiple columns
  3. Text preprocessing handles multiple columns oops
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(strategy='constant', fill_value='?')),
    ('One Hot Encoding', OneHotEncoder(handle_unknown='ignore')),
]
)
# Numeric pipeline
numeric_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(strategy='mean')),
     ('Scaling', StandardScaler())
]
)
text_preprocessing = Pipeline(
[
     ('Text',TfidfVectorizer())       
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (numeric_preprocessing, numeric_features),
     (categorical_preprocessing, categorical_features),
     (text_preprocessing, text_features),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)
test = pipeline.fit_transform(x_train)

I fixed this bug by replacing TfidfVectorizer with sagemaker_sklearn_extension.feature_extraction.text import MultiColumnTfidfVectorizer.

The fact that AWS added this to their sagemaker_sklearn_extension extensions indicate their users frequently run into this problem too.

0reactions
zachmayercommented, Feb 19, 2022

I’ve used SageMaker’s TFIDFVectorizer and I like it. It’s nice and simple. Doesn’t support all the parameters of TFIDFVectorizer in sklearn though which is annoying

Read more comments on GitHub >

github_iconTop Results From Across the Web

Vectorize two text columns in a ColumnTransformer - YouTube
Want to vectorize two text columns in a ColumnTransformer?You can't pass them in a list, but you can pass the vectorizer twice!
Read more >
Computing separate tfidf scores for two different columns ...
Here's how it would work for your data: import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import ...
Read more >
sklearn.feature_extraction.text.TfidfVectorizer
The callable handles preprocessing, tokenization, and n-grams generation. Returns: analyzer: callable. A function to handle preprocessing, tokenization and n- ...
Read more >
Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer
Convert text documents to vectors using TF-IDF vectorizer for topic ... and contains two “the”, the TF ratio of this word would be...
Read more >
TF-IDF Vectorizer scikit-learn - Medium
CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer import pandas as pd# set of documentstrain ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found