Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

a strange bug in sklearn.preprocessing.OneHotEncoder() when transform unknown string with handle_unknown='ignore'

See original GitHub issue

Describe the bug

Using sklearn.preprocessing.OneHotEncoder() with parameter handle_unknown=‘ignore’ will get a wrong output, when I manually set categories_ && all objects are of string type && an unknown category is encountered during transform.

Steps/Code to Reproduce

import numpy as np
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
onehot = OneHotEncoder(handle_unknown='ignore')
onehot.categories_=[np.array(['bread', 'milk', 'diaper', 'beer'])for i in range(3)]
output = onehot.transform([['bread', 'milk', 'none'],['bread', 'diaper', 'beer']]).toarray()
print(output)

Expected Results

[[1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]]

Actual Results

[[1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.]]

Versions

System: python: 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] executable: D:\ProgramData\Anaconda3\python.exe machine: Windows-10-10.0.19041-SP0

Python dependencies: pip: 21.0.1 setuptools: 45.2.0.post20200210 sklearn: 0.22.1 numpy: 1.16.0 scipy: 1.4.1 Cython: 0.29.15 pandas: 0.25.0 matplotlib: 3.1.0 joblib: 0.14.1

Built with OpenMP: True

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

thomasjpfancommented, Apr 7, 2021

sklearn’s OneHotEncoder API currently requires fit to be called. In your case:

import numpy as np
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

categories = [np.array(['bread', 'milk', 'diaper', 'beer'])for i in range(3)]
onehot = OneHotEncoder(handle_unknown='ignore', categories=categories,
                       sparse=False)
output = onehot.fit_transform(
    [['bread', 'milk', 'none'],
     ['bread', 'diaper', 'beer']])
print(output)
# [[1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
#  [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]]

In principle, we could design OneHotEncoder to be more stateness and not require fit.

0reactions

glemaitrecommented, Apr 9, 2021

So there is no bug then. I am closing the issue. We could convert it to a discussion I think?

Top Results From Across the Web

sklearn.preprocessing.OneHotEncoder

Encode categorical features as a one-hot numeric array. The input to this transformer should be an array-like of integers or strings, denoting the...

sklearn.preprocessing.OneHotEncoder: using drop and ...

import pandas as pd >>> from sklearn.preprocessing import ... try: return super().transform(X) except ValueError as e: if 'Found unknown ...

Handle Unknown Categories Using OneHotEncoder

Now if we will change handle_unknown to 'error', then it will give an error when found unknown category. Code snippet: enc = OneHotEncoder( ......

sklearn.preprocessing.OneHotEncoder

Given a dataset with three features and two samples, we let the encoder find the maximum value per feature and transform the data...

Solving "Found unknown categories [...] in column" with ...

What do you do when your OneHotEncoder meets unseen data? Learn how to solve it by setting one specific argument.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

a strange bug in sklearn.preprocessing.OneHotEncoder() when transform unknown string with handle_unknown='ignore'

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

make_sparse_coded_signal returns data transposed

`PLSRegression` fails to fit some data with `StopIteration`