Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OneHotEncoder generates a `2` and inconsistent data.

See original GitHub issue

Description

When you create a OneHotEncoder with certain categories and then fit it with a numpy.array with dtype=object, when calling transform with different type of dtype it will generate an incoherent output.

Steps/Code to Reproduce

Example:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

categories = np.array([['mse'], ['mae']], dtype='object')

encoder = OneHotEncoder(categories=[['mse','mae']], sparse=False)
encoder.fit(categories)

values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
encoder.transform(values)

which outputs:

array([[1., 0.],
       [1., 0.],
       [0., 0.],
       [2., 0.],
       [0., 0.],
       [1., 0.]])

However if we run values.astype(object) we get the expected output:

encoder.transform(values.astype(object))

array([[0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.]])

Versions

scikit-learn==0.21.3 scipy==1.3.3 numpy==1.17.4

Issue Analytics

State:
Created 4 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

absognetycommented, Dec 4, 2019

switching the input for categories to auto is giving expected output. the handling of encoders is changed in version 0.22.

import numpy as np
from sklearn.preprocessing import OneHotEncoder

categories = np.array([['mse'], ['mae']], dtype='object')

encoder = OneHotEncoder(categories='auto',sparse=False)
encoder.fit(categories)

values = np.array([['mae'], ['mae'], ['mse'], ['mae'], ['mse'], ['mse']])
transformed = encoder.transform(values)
print (transformed)

array([[1., 0.],[1., 0.],[0., 1.],[1., 0.],[0., 1.],[0., 1.]])

versions:

scikit-learn == 0.21.3
scipy == 1.1.0
numpy == 1.15.1

0reactions

jnothmancommented, Dec 4, 2019

Does #15763 fix this?

Top Results From Across the Web

Different number of features after using OneHotEncoder

One hot encoding is a way of converting output label for 3 categories like 2 into [0, 1, 0] or 3 into [0,...

Found input variables with inconsistent numbers of samples

In categorical label encoding. I know that I need to use OneHotEncoder() because Feature names differs in test so cannot ...

sklearn.preprocessing.OneHotEncoder

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary...

Categorical Variables — Applied Machine Learning in Python

extract a column and convert it to categorical data (it was represented as ... 0 MORTGAGE 1 MORTGAGE 2 MORTGAGE 3 RENT 4...

Difference between OneHotEncoder and get_dummies

Hence it may result in inconsistencies with train and test data ... As expected get_dummies will create only 2 columns “day_Mon” and ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

OneHotEncoder generates a `2` and inconsistent data.

Description

Steps/Code to Reproduce

Versions

Issue Analytics

Top GitHub Comments

versions:

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

QuantileTransformer quantiles can be unordered because of rounding errors which cause np.interp to return nonsense results

RuntimeError: "Cannot clone object ..." when using clone with an implementation of BaseEstimator that copies objects in get_params method