Better error message when passing un-sortable data to the Encoders
See original GitHub issueDescription
scikit-learn’s handling of errors where an unexpected / unusable value appears in input leaves something to be desired. Errors are cryptic and confusing.
Steps/Code to Reproduce
Example:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Simulate missing value in data
feature_with_missing_value = pd.Series([1, 2, 3, '?', 42, 69])
LabelEncoder().fit_transform(feature_with_missing_value.values)
Something similar also happened when I absentmindedly concatenated two CSV files without ensuring only one header appeared at the top of the file and attempted to use LabelEncoder
with the second header sandwiched in the middle.
I have also encountered this issue I believe with pipelines and missing values before but this was a while ago and I eventually figured out what was happening so unfortunately I can’t replicate that error.
Expected Results
A reasonable error message that addresses the issue in a clear and direct manner. Here is an example of what that would look like:
https://www.kaggle.com/c/titanic/discussion/26976
Error in predict.randomForest(rf, extractFeatures(test)) : missing values in newdata
Because scikit-learn’s algorithms currently only accept numerical input (AFAIK), any non-numerical data should be treated as missing values or otherwise seen as aberrant.
Actual Results
Traceback (most recent call last):
File "poop.py", line 5, in <module>
LabelEncoder().fit_transform(feature_with_missing_value)
File "/usr/local/lib/python3.5/dist-packages/sklearn/preprocessing/label.py", line 112, in fit_transform
self.classes_, y = np.unique(y, return_inverse=True)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 223, in unique
return _unique1d(ar, return_index, return_inverse, return_counts)
File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 280, in _unique1d
perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: unorderable types: str() < int()
Versions
Home:
Linux-4.4.0-137-generic-i686-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.2
SciPy 1.0.1
Scikit-Learn 0.19.1
Google Cloud:
System
------
machine: Linux-4.14.33+-x86_64-with-debian-9.5
python: 3.5.3 (default, Sep 27 2018, 17:25:39) [GCC 6.3.0 20170516]
executable: /usr/bin/python3
BLAS
----
lib_dirs:
macros:
cblas_libs: cblas
Python deps
-----------
setuptools: 40.6.2
scipy: 1.1.0
pip: 9.0.1
Cython: None
numpy: 1.15.4
pandas: 0.23.4
sklearn: 0.20.0
Issue Analytics
- State:
- Created 5 years ago
- Comments:27 (19 by maintainers)
Top GitHub Comments
I’ve just tried to reproduce this one but it now returns an appropriate error message
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
Thanks for the report. Can confirm the same issue in master. It’s actually unrelated to missing values. We try hard to give good error messages so your feedback is much appreciated. If you can reproduce the other issue, please let us know. Btw, the chance that you need to use a
LabelEncoder
is pretty slim (which is not an excuse for bad error messages). If it’s actually a label, you don’t need to encode it. If it’s the data, you should be using OneHotEncoder (in version 0.20).