Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better error message when passing un-sortable data to the Encoders

See original GitHub issue

Description

scikit-learn’s handling of errors where an unexpected / unusable value appears in input leaves something to be desired. Errors are cryptic and confusing.

Steps/Code to Reproduce

Example:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Simulate missing value in data
feature_with_missing_value = pd.Series([1, 2, 3, '?', 42, 69])
LabelEncoder().fit_transform(feature_with_missing_value.values)

Something similar also happened when I absentmindedly concatenated two CSV files without ensuring only one header appeared at the top of the file and attempted to use LabelEncoder with the second header sandwiched in the middle.

I have also encountered this issue I believe with pipelines and missing values before but this was a while ago and I eventually figured out what was happening so unfortunately I can’t replicate that error.

Expected Results

A reasonable error message that addresses the issue in a clear and direct manner. Here is an example of what that would look like:

https://www.kaggle.com/c/titanic/discussion/26976

Error in predict.randomForest(rf, extractFeatures(test)) : missing values in newdata

Because scikit-learn’s algorithms currently only accept numerical input (AFAIK), any non-numerical data should be treated as missing values or otherwise seen as aberrant.

Actual Results

Traceback (most recent call last):
  File "poop.py", line 5, in <module>
    LabelEncoder().fit_transform(feature_with_missing_value)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/preprocessing/label.py", line 112, in fit_transform
    self.classes_, y = np.unique(y, return_inverse=True)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 223, in unique
    return _unique1d(ar, return_index, return_inverse, return_counts)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 280, in _unique1d
    perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: unorderable types: str() < int()

Versions

Home:

Linux-4.4.0-137-generic-i686-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.2
SciPy 1.0.1
Scikit-Learn 0.19.1

Google Cloud:

System
------
   machine: Linux-4.14.33+-x86_64-with-debian-9.5
    python: 3.5.3 (default, Sep 27 2018, 17:25:39)  [GCC 6.3.0 20170516]
executable: /usr/bin/python3
BLAS
----
  lib_dirs:
    macros:
cblas_libs: cblas
Python deps
-----------
setuptools: 40.6.2
     scipy: 1.1.0
       pip: 9.0.1
    Cython: None
     numpy: 1.15.4
    pandas: 0.23.4
   sklearn: 0.20.0

Issue Analytics

State:
Created 5 years ago
Comments:27 (19 by maintainers)

Top GitHub Comments

1reaction

bmaisonncommented, Jul 25, 2020

I’ve just tried to reproduce this one but it now returns an appropriate error message

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

1reaction

amuellercommented, Nov 20, 2018

Thanks for the report. Can confirm the same issue in master. It’s actually unrelated to missing values. We try hard to give good error messages so your feedback is much appreciated. If you can reproduce the other issue, please let us know. Btw, the chance that you need to use a LabelEncoder is pretty slim (which is not an excuse for bad error messages). If it’s actually a label, you don’t need to encode it. If it’s the data, you should be using OneHotEncoder (in version 0.20).