LabelEncoder throws an error when it's used in a Pipeline or in a ColumnTransform
See original GitHub issueDescription
fit and fit_transform methods in LabelEncoder don’t follow the standard scikit-lean convention for these methods: fit(X[, y]) and fit_transform(X[, y]). The fit and fit_transform method in the LabelEncoder only accepts one argument: fit(y) and fit_transform(y).
Therefore, LabelEncoder couldn’t be used inside a Pipeline or a ColumnTransform. I suspect that there are a bunch of other classes in which it doesn’t work (GridSearchCV, …) but I haven’t tested it.
In contrast, fit and fit_transform methods in OneHotEncoder and OrdinalEncoder follows the standard scikit-learn signature.
See reference: LabelEncoder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html OneHotEnconder: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder OrdinalEncoder:https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder
Steps/Code to Reproduce
Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
import sklearn.tree as tree
X = pd.DataFrame(
{'city': ['London', 'London', 'Paris', 'Sallisaw'],
'title': ["His Last Bow", "How Watson Learned the Trick",
"A Moveable Feast", "The Grapes of Wrath"],
'expert_rating': [5, 3, 4, 5],
'user_rating': [4, 5, 4, 3]})
column_trans = ColumnTransformer(
[('title_bow', LabelEncoder(), 'title')],
remainder='drop').fit(X)
pipe = make_pipeline(LabelEncoder(), tree.DecisionTreeClassifier()).fit(X)
Expected Results
No error is thrown.
Actual Results
The same error in both cases: TypeError: fit_transform() takes 2 positional arguments but 3 were given.
Versions
System: python: 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16) [GCC 7.3.0] executable: /home/twins/anaconda3/envs/pytorch/bin/python machine: Linux-4.8.0-56-generic-x86_64-with-debian-stretch-sid
BLAS: macros: SCIPY_MKL_H=None, HAVE_CBLAS=None lib_dirs: /home/twins/anaconda3/envs/pytorch/lib cblas_libs: mkl_rt, pthread
Python deps: pip: 18.1 setuptools: 40.6.2 sklearn: 0.20.1 numpy: 1.15.4 scipy: 1.1.0 Cython: None pandas: 0.23.4
Thanks for the amazing job you do !
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:11 (4 by maintainers)
Top GitHub Comments
Yes indeed, from the user guide:
To encode features, you need to use
OneHotEncoder
orOrdinalEncoder
.What if I want to Labelencode the input feature? With Labelencode there is also problem with unseen category (for e.x in the test set) What should I do then?