Handle missing values in OneHotEncoder
See original GitHub issueA minimum implementation might translate a NaN in input to a row of NaNs in output. I believe this would be the most consistent default behaviour with respect to other preprocessing tools, and with reasonable backwards-compatibility, but other core devs might disagree (see https://github.com/scikit-learn/scikit-learn/issues/10465#issuecomment-394439632).
NaN should also be excluded from the categories identified in fit
.
A handle_missing
parameter might allow NaN in input to be:
- replaced with a row of NaNs as above
- replaced with a row of zeros
- represented with a separate one-hot column
in the output.
A missing_values
parameter might allow the user to configure what object is a placeholder for missingness (e.g. NaN, None, etc.).
See #10465 for background
Issue Analytics
- State:
- Created 5 years ago
- Reactions:23
- Comments:17 (14 by maintainers)
Top Results From Across the Web
How to handle missing values (NaN) in categorical data when ...
If using Scikit-Learn's One Hot Encoder is necessary, you can fill nan values with pandas filna('something') , one hot encode this as a...
Read more >How to handle missing values (NaN) in categorical data when ...
How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder? · Change the NaN values with "Others". · Then proceed...
Read more >Machine Learning: Missing Data & One-Hot-Encoding | Kaggle
In this step, I will take three approaches to dealing with missing values. This is part of a task into learning how to...
Read more >scikit-learn : Data Preprocessing I - Missing/categorical data
Handling missing data ... Dealing with categorical data ... Note that when we initialized the OneHotEncoder, we defined the column position of the...
Read more >How to Handle Missing Values of Categorical Variables?
Missing value correction is required to reduce bias and to produce powerful suitable models. Most of the algorithms can't handle missing data, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Perhaps:
handle_missing='all-missing'
:handle_missing='all-zero'
:handle_missing='category'
:A good idea might be to start by writing things other than the implementation:
I am also +1 for not supporting the option that would generate a row of nans, it sounds like YAGNI to me.
Let’s consider the following data case with a CSV file with 2 categorical columns, where one uses string labels and the other uses integer labels:
So by default pandas will use float64 dtype for the int-valued column so as to be able to use nan as the missing value marker.
It’s actually possible to use
SimpleImputer
with the constant strategy on this kind of heterogeneously typed data as it will convert it to a numpy array with object dtype:However putting string values in an otherwise float valued column is weird and causes the OneHotEncoder to crash on that column:
Using the debugger to see the underlying exception reveals:
One could use the column transformer to split the string valued categories from the number valued categorical columns and use suitable
fill_value
for constant imputing on each side.However from a usability standpoint it would make sense to have
OneHotEncoder
be able to directly to do constant imputation withhandle_missing="indicator"
.We could also implement the zero strategy with
handle_missing="zero"
. We need to decide about the default. missing_We also need to make sure that nan passed only at transform time (without being seen in this column at fit time) should be accepted (with the zero encoding) so that cross-validation is possible on data with just a few missing values that might end up all in the validation split by chance.