OneHotEncoder generates a `2` and inconsistent data.
See original GitHub issueDescription
When you create a OneHotEncoder
with certain categories and then fit it with a numpy.array
with dtype=object
, when calling transform
with different type of dtype
it will generate an incoherent output.
Steps/Code to Reproduce
Example:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
categories = np.array([['mse'], ['mae']], dtype='object')
encoder = OneHotEncoder(categories=[['mse','mae']], sparse=False)
encoder.fit(categories)
values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
encoder.transform(values)
which outputs:
array([[1., 0.],
[1., 0.],
[0., 0.],
[2., 0.],
[0., 0.],
[1., 0.]])
However if we run values.astype(object)
we get the expected output:
encoder.transform(values.astype(object))
array([[0., 1.],
[0., 1.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.]])
Versions
scikit-learn==0.21.3 scipy==1.3.3 numpy==1.17.4
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Different number of features after using OneHotEncoder
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0,...
Read more >Found input variables with inconsistent numbers of samples
In categorical label encoding. I know that I need to use OneHotEncoder() because Feature names differs in test so cannot ...
Read more >sklearn.preprocessing.OneHotEncoder
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary...
Read more >Categorical Variables — Applied Machine Learning in Python
extract a column and convert it to categorical data (it was represented as ... 0 MORTGAGE 1 MORTGAGE 2 MORTGAGE 3 RENT 4...
Read more >Difference between OneHotEncoder and get_dummies
Hence it may result in inconsistencies with train and test data ... As expected get_dummies will create only 2 columns “day_Mon” and ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
switching the input for categories to
auto
is giving expected output. the handling of encoders is changed in version 0.22.versions:
Does #15763 fix this?