question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better error message when passing un-sortable data to the Encoders

See original GitHub issue

Description

scikit-learn’s handling of errors where an unexpected / unusable value appears in input leaves something to be desired. Errors are cryptic and confusing.

Steps/Code to Reproduce

Example:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Simulate missing value in data
feature_with_missing_value = pd.Series([1, 2, 3, '?', 42, 69])
LabelEncoder().fit_transform(feature_with_missing_value.values)

Something similar also happened when I absentmindedly concatenated two CSV files without ensuring only one header appeared at the top of the file and attempted to use LabelEncoder with the second header sandwiched in the middle.

I have also encountered this issue I believe with pipelines and missing values before but this was a while ago and I eventually figured out what was happening so unfortunately I can’t replicate that error.

Expected Results

A reasonable error message that addresses the issue in a clear and direct manner. Here is an example of what that would look like:

https://www.kaggle.com/c/titanic/discussion/26976

Error in predict.randomForest(rf, extractFeatures(test)) : missing values in newdata

Because scikit-learn’s algorithms currently only accept numerical input (AFAIK), any non-numerical data should be treated as missing values or otherwise seen as aberrant.

Actual Results

Traceback (most recent call last):
  File "poop.py", line 5, in <module>
    LabelEncoder().fit_transform(feature_with_missing_value)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/preprocessing/label.py", line 112, in fit_transform
    self.classes_, y = np.unique(y, return_inverse=True)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 223, in unique
    return _unique1d(ar, return_index, return_inverse, return_counts)
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/arraysetops.py", line 280, in _unique1d
    perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
TypeError: unorderable types: str() < int()

Versions

Home:

Linux-4.4.0-137-generic-i686-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.2
SciPy 1.0.1
Scikit-Learn 0.19.1

Google Cloud:

System
------
   machine: Linux-4.14.33+-x86_64-with-debian-9.5
    python: 3.5.3 (default, Sep 27 2018, 17:25:39)  [GCC 6.3.0 20170516]
executable: /usr/bin/python3
BLAS
----
  lib_dirs:
    macros:
cblas_libs: cblas
Python deps
-----------
setuptools: 40.6.2
     scipy: 1.1.0
       pip: 9.0.1
    Cython: None
     numpy: 1.15.4
    pandas: 0.23.4
   sklearn: 0.20.0

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:27 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
bmaisonncommented, Jul 25, 2020

I’ve just tried to reproduce this one but it now returns an appropriate error message

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']

1reaction
amuellercommented, Nov 20, 2018

Thanks for the report. Can confirm the same issue in master. It’s actually unrelated to missing values. We try hard to give good error messages so your feedback is much appreciated. If you can reproduce the other issue, please let us know. Btw, the chance that you need to use a LabelEncoder is pretty slim (which is not an excuse for bad error messages). If it’s actually a label, you don’t need to encode it. If it’s the data, you should be using OneHotEncoder (in version 0.20).

Read more comments on GitHub >

github_iconTop Results From Across the Web

label-encoder encoding missing values - Stack Overflow
1 your code raises TypeError: unorderable types: str() > float() . As you can see in the source it uses numpy.unique against the...
Read more >
Four Steps to End Encoder Problems - YouTube
Four Steps to End Encoder Problems. 22K views · 7 years ago ... more. Design World. 33.4K. Subscribe. 77. Share. Save. Report. Comments...
Read more >
Error Messages Explanations - Overwolf Support
This error message will appear when your driver for the selected encoder is out of date. If you are using one of the...
Read more >
CWE-209: Generation of Error Message Containing Sensitive ...
The software generates an error message that includes sensitive information about its environment, users, or associated data. + Extended Description.
Read more >
Error message guidelines | Apache Spark
What: Unable to generate encoder inner class. Why: Did not have access to the scope that the class was defined in. How: Try...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found