RFC Behavior of OneHotEncoder on binary features
See original GitHub issueRight now OneHotEncoder expands every bindary feature into two features afaik.
I’m not sure if this is a great/convenient behavior. I think it might be nicer if (optionally?) it’d use a single column - that’s particularly natural if the feature was already 0
and 1
Otherwise that either makes the ColumnTransformer people have to use more complicated, or creates redundant features.
One might argue that one possible cure for that is the option to drop one of the indicator variables. I’m not really sure if that’s what I want, though. In my mind having a base category is more interpretable in the binary case than in the multinomial case.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:15 (15 by maintainers)
Top Results From Across the Web
Drop the first category from binary features (only ... - YouTube
New in version 0.23: Use drop='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has ...
Read more >Random Forest Classifier in Python | by Joe Tran
In this article, I am using the dataset taken from my real technical test with a tech company for a data science position....
Read more >OneHotEncoder — 1.5.2 - Feature-engine
The OneHotEncoder() performs one hot encoding. One hot encoding consists in replacing the categorical variable by a group of binary variables which take...
Read more >HR-analytics - Deepnote
The data we will be looking at includes 11 different features: ... Our categorical column we want to OneHotEncoder cat_cols = ['city', ...
Read more >Pipelines — EvalML 0.64.0 documentation - Alteryx
E.g. A pipeline with an imputer, one-hot encoder, and logistic regression ... where a feature might have been dropped or detecting unexpected behavior....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I realize this issue is closed, but I wanted to offer my opinion here in case it’s useful for future discussions of OneHotEncoder.
I saw this comment above from @jnothman:
Personally, I would advocate for
drop=None
to continue to be the default. Here’s why:drop=None
is consistent and easy to understand. For example, if you have 1 possible value in the feature, OHE creates 1 column. 2 possible values creates 2 columns. 3 possible values creates 3 columns. (And so on.) In contrast,drop='if_binary'
is less consistent and thus harder to understand. If you have 1 possible value in the feature, OHE creates 1 column. 2 possible values creates 1 column. 3 possible values creates 3 columns.drop=None
is compatible withhandle_unknown='ignore'
, which is a very useful option, whereasdrop='if_binary'
is incompatible with that option.drop=None
is consistent with theget_dummies()
function in pandas.Thus for the majority of cases in which you don’t need to remove redundant columns,
drop=None
is a sensible default.For the minority of cases in which you do need to remove redundant columns,
drop='first'
is already a great option.Thus for me,
drop='if_binary'
would be the least desirable default.@pavopax there are a couple of other threads in which people discuss some useful information about this–dropping a value seems to matter most in regression (both unregularized and regularized). In the second case, it effects the value of the coefficients, which is important for interpretation. In the second case, this can even effect which columns drop out, which can lead to different solutions. Perfect collinearity also plays poorly with some other models, such as Keras neural networks.