TfidfVectorizer handles multiple text columns
See original GitHub issueIt’s really nice that transformers such as sklearn.preprocessing.OneHotEncoder
and sklearn.preprocessing.StandardScaler
can operate on multiple data columns simultaneously.
sklearn.feature_extraction.text.TfidfVectorizer
on the other hand, can only process one column at a time, so you need to make a new transformer for each text column in your dataset. This can get a little tedious and in particular makes pipelines more verbose.
It’d be nice if TfidfVectorizer
could also operate on multiple text columns, using the same settings for each column, perhaps with an option to make one vocabulary per column, or use a shared vocabulary across all the columns.
It might be easiest to implement this as a new class that wraps TfidfVectorizer
sagemaker-scikit-learn-extension takes this approach.
If this seems like a good idea, I’d be happy to make a PR.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:5
- Comments:20 (20 by maintainers)
@jnothman @amueller Here’s an example pipeline I was debugging today. It follows a common failure pattern I’ve seen a few times now:
handles multiple columnsoopsI fixed this bug by replacing
TfidfVectorizer
withsagemaker_sklearn_extension.feature_extraction.text import MultiColumnTfidfVectorizer
.The fact that AWS added this to their sagemaker_sklearn_extension extensions indicate their users frequently run into this problem too.
I’ve used SageMaker’s TFIDFVectorizer and I like it. It’s nice and simple. Doesn’t support all the parameters of TFIDFVectorizer in sklearn though which is annoying