a strange bug in sklearn.preprocessing.OneHotEncoder() when transform unknown string with handle_unknown='ignore'
See original GitHub issueDescribe the bug
Using sklearn.preprocessing.OneHotEncoder() with parameter handle_unknown=‘ignore’ will get a wrong output, when I manually set categories_ && all objects are of string type && an unknown category is encountered during transform.
Steps/Code to Reproduce
import numpy as np
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
onehot = OneHotEncoder(handle_unknown='ignore')
onehot.categories_=[np.array(['bread', 'milk', 'diaper', 'beer'])for i in range(3)]
output = onehot.transform([['bread', 'milk', 'none'],['bread', 'diaper', 'beer']]).toarray()
print(output)
Expected Results
[[1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.]]
Actual Results
[[1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0.]]
Versions
System: python: 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] executable: D:\ProgramData\Anaconda3\python.exe machine: Windows-10-10.0.19041-SP0
Python dependencies: pip: 21.0.1 setuptools: 45.2.0.post20200210 sklearn: 0.22.1 numpy: 1.16.0 scipy: 1.4.1 Cython: 0.29.15 pandas: 0.25.0 matplotlib: 3.1.0 joblib: 0.14.1
Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
sklearn.preprocessing.OneHotEncoder
Encode categorical features as a one-hot numeric array. The input to this transformer should be an array-like of integers or strings, denoting the...
Read more >sklearn.preprocessing.OneHotEncoder: using drop and ...
import pandas as pd >>> from sklearn.preprocessing import ... try: return super().transform(X) except ValueError as e: if 'Found unknown ...
Read more >Handle Unknown Categories Using OneHotEncoder
Now if we will change handle_unknown to 'error', then it will give an error when found unknown category. Code snippet: enc = OneHotEncoder( ......
Read more >sklearn.preprocessing.OneHotEncoder
Given a dataset with three features and two samples, we let the encoder find the maximum value per feature and transform the data...
Read more >Solving "Found unknown categories [...] in column" with ...
What do you do when your OneHotEncoder meets unseen data? Learn how to solve it by setting one specific argument.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
sklearn’s
OneHotEncoder
API currently requiresfit
to be called. In your case:In principle, we could design
OneHotEncoder
to be more stateness and not requirefit
.So there is no bug then. I am closing the issue. We could convert it to a discussion I think?