Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle missing values in OneHotEncoder

See original GitHub issue

A minimum implementation might translate a NaN in input to a row of NaNs in output. I believe this would be the most consistent default behaviour with respect to other preprocessing tools, and with reasonable backwards-compatibility, but other core devs might disagree (see https://github.com/scikit-learn/scikit-learn/issues/10465#issuecomment-394439632).

NaN should also be excluded from the categories identified in fit.

A handle_missing parameter might allow NaN in input to be:

replaced with a row of NaNs as above
replaced with a row of zeros
represented with a separate one-hot column

in the output.

A missing_values parameter might allow the user to configure what object is a placeholder for missingness (e.g. NaN, None, etc.).

See #10465 for background

Issue Analytics

State:
Created 5 years ago
Reactions:23
Comments:17 (14 by maintainers)

Top GitHub Comments

12reactions

jnothmancommented, Sep 5, 2018

Perhaps:

X = [["A"],
     ["B"],
     [NaN],
     ["B"]]

handle_missing='all-missing':

Xt = [[  1,   0],
      [  0,   1],
      [NaN, NaN],
      [  0,   1]]

handle_missing='all-zero':

Xt = [[  1,   0],
      [  0,   1],
      [  0,   0],
      [  0,   1]]

handle_missing='category':

Xt = [[  1,   0,  0],
      [  0,   1,  0],
      [  0,   0,  1],
      [  0,   1,  0]]

A good idea might be to start by writing things other than the implementation:

docstring
tests
doc/modules/preprocessing.rst where you could outline the pros and cons of each of these options

3reactions

ogriselcommented, Jul 18, 2019

I am also +1 for not supporting the option that would generate a row of nans, it sounds like YAGNI to me.

Let’s consider the following data case with a CSV file with 2 categorical columns, where one uses string labels and the other uses integer labels:

>>> import pandas as pd                                                                                                                                                         
>>> from io import StringIO                                                                                                                                                     
>>> csv_content = """\ 
... f1,f2 
... "a",0 
... ,1 
... "b", 
... , 
... """                                                                                                                                                                         
>>> raw_df = pd.read_csv(StringIO(csv_content))                                                                                                                                 
>>> raw_df                                                                                                                                                                      
    f1   f2
0    a  0.0
1  NaN  1.0
2    b  NaN
3  NaN  NaN
>>> raw_df.dtypes                                                                                                                                                               
f1     object
f2    float64
dtype: object

So by default pandas will use float64 dtype for the int-valued column so as to be able to use nan as the missing value marker.

It’s actually possible to use SimpleImputer with the constant strategy on this kind of heterogeneously typed data as it will convert it to a numpy array with object dtype:

>>> from sklearn.impute import SimpleImputer                                                                                                                                    
>>> imputed = SimpleImputer(strategy="constant", fill_value="missing").fit_transform(raw_df)
>>> imputed
array([['a', 0.0],
       ['missing', 1.0],
       ['b', 'missing'],
       ['missing', 'missing']], dtype=object)

However putting string values in an otherwise float valued column is weird and causes the OneHotEncoder to crash on that column:

>>> OneHotEncoder().fit_transform(imputed)                                                                                                                                      
Traceback (most recent call last):
  File "<ipython-input-48-04b9d558c891>", line 1, in <module>
    OneHotEncoder().fit_transform(imputed)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 358, in fit_transform
    return super().fit_transform(X, y)
  File "/home/ogrisel/code/scikit-learn/sklearn/base.py", line 556, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 338, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/_encoders.py", line 86, in _fit
    cats = _encode(Xi)
  File "/home/ogrisel/code/scikit-learn/sklearn/preprocessing/label.py", line 114, in _encode
    raise TypeError("argument must be a string or number")
TypeError: argument must be a string or number

Using the debugger to see the underlying exception reveals:

TypeError: '<' not supported between instances of 'str' and 'float'

One could use the column transformer to split the string valued categories from the number valued categorical columns and use suitable fill_value for constant imputing on each side.

However from a usability standpoint it would make sense to have OneHotEncoder be able to directly to do constant imputation with handle_missing="indicator".

We could also implement the zero strategy with handle_missing="zero". We need to decide about the default. missing_

We also need to make sure that nan passed only at transform time (without being seen in this column at fit time) should be accepted (with the zero encoding) so that cross-validation is possible on data with just a few missing values that might end up all in the validation split by chance.

Top Results From Across the Web

How to handle missing values (NaN) in categorical data when ...

If using Scikit-Learn's One Hot Encoder is necessary, you can fill nan values with pandas filna('something') , one hot encode this as a...

How to handle missing values (NaN) in categorical data when ...

How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? · Change the NaN values with "Others". · Then proceed...

Machine Learning: Missing Data & One-Hot-Encoding | Kaggle

In this step, I will take three approaches to dealing with missing values. This is part of a task into learning how to...

scikit-learn : Data Preprocessing I - Missing/categorical data

Handling missing data ... Dealing with categorical data ... Note that when we initialized the OneHotEncoder, we defined the column position of the...

How to Handle Missing Values of Categorical Variables?

Missing value correction is required to reduce bias and to produce powerful suitable models. Most of the algorithms can't handle missing data, ...