question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OneHotEncoder generates a `2` and inconsistent data.

See original GitHub issue

Description

When you create a OneHotEncoder with certain categories and then fit it with a numpy.array with dtype=object, when calling transform with different type of dtype it will generate an incoherent output.

Steps/Code to Reproduce

Example:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

categories = np.array([['mse'], ['mae']], dtype='object')

encoder = OneHotEncoder(categories=[['mse','mae']], sparse=False)
encoder.fit(categories)

values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
encoder.transform(values)

which outputs:

array([[1., 0.],
       [1., 0.],
       [0., 0.],
       [2., 0.],
       [0., 0.],
       [1., 0.]])

However if we run values.astype(object) we get the expected output:

encoder.transform(values.astype(object))
array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

Versions

scikit-learn==0.21.3 scipy==1.3.3 numpy==1.17.4

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
absognetycommented, Dec 4, 2019

switching the input for categories to auto is giving expected output. the handling of encoders is changed in version 0.22.

import numpy as np
from sklearn.preprocessing import OneHotEncoder

categories = np.array([['mse'], ['mae']], dtype='object')

encoder = OneHotEncoder(categories='auto',sparse=False)
encoder.fit(categories)

values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
transformed = encoder.transform(values)
print (transformed)
array([[1., 0.],[1., 0.],[0., 1.],[1., 0.],[0., 1.],[0., 1.]])

versions:

scikit-learn == 0.21.3
scipy == 1.1.0
numpy == 1.15.1
0reactions
jnothmancommented, Dec 4, 2019

Does #15763 fix this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Different number of features after using OneHotEncoder
One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0,...
Read more >
Found input variables with inconsistent numbers of samples
In categorical label encoding. I know that I need to use OneHotEncoder() because Feature names differs in test so cannot ...
Read more >
sklearn.preprocessing.OneHotEncoder
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary...
Read more >
Categorical Variables — Applied Machine Learning in Python
extract a column and convert it to categorical data (it was represented as ... 0 MORTGAGE 1 MORTGAGE 2 MORTGAGE 3 RENT 4...
Read more >
Difference between OneHotEncoder and get_dummies
Hence it may result in inconsistencies with train and test data ... As expected get_dummies will create only 2 columns “day_Mon” and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found