question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC Behavior of OneHotEncoder on binary features

See original GitHub issue

Right now OneHotEncoder expands every bindary feature into two features afaik. I’m not sure if this is a great/convenient behavior. I think it might be nicer if (optionally?) it’d use a single column - that’s particularly natural if the feature was already 0 and 1 Otherwise that either makes the ColumnTransformer people have to use more complicated, or creates redundant features. One might argue that one possible cure for that is the option to drop one of the indicator variables. I’m not really sure if that’s what I want, though. In my mind having a base category is more interpretable in the binary case than in the multinomial case.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:4
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
justmarkhamcommented, Apr 2, 2020

I realize this issue is closed, but I wanted to offer my opinion here in case it’s useful for future discussions of OneHotEncoder.

I saw this comment above from @jnothman:

I’d be happy to add drop=‘binary’ and eventually change the default. But I’m okay with having drop=‘first’ merged as a first go.

Personally, I would advocate for drop=None to continue to be the default. Here’s why:

  1. drop=None is consistent and easy to understand. For example, if you have 1 possible value in the feature, OHE creates 1 column. 2 possible values creates 2 columns. 3 possible values creates 3 columns. (And so on.) In contrast, drop='if_binary' is less consistent and thus harder to understand. If you have 1 possible value in the feature, OHE creates 1 column. 2 possible values creates 1 column. 3 possible values creates 3 columns.
  2. drop=None is compatible with handle_unknown='ignore', which is a very useful option, whereas drop='if_binary' is incompatible with that option.
  3. drop=None is consistent with the get_dummies() function in pandas.

Thus for the majority of cases in which you don’t need to remove redundant columns, drop=None is a sensible default.

For the minority of cases in which you do need to remove redundant columns, drop='first' is already a great option.

Thus for me, drop='if_binary' would be the least desirable default.

1reaction
drewmjohnstoncommented, Feb 23, 2019

@pavopax there are a couple of other threads in which people discuss some useful information about this–dropping a value seems to matter most in regression (both unregularized and regularized). In the second case, it effects the value of the coefficients, which is important for interpretation. In the second case, this can even effect which columns drop out, which can lead to different solutions. Perfect collinearity also plays poorly with some other models, such as Keras neural networks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Drop the first category from binary features (only ... - YouTube
New in version 0.23: Use drop='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has ...
Read more >
Random Forest Classifier in Python | by Joe Tran
In this article, I am using the dataset taken from my real technical test with a tech company for a data science position....
Read more >
OneHotEncoder — 1.5.2 - Feature-engine
The OneHotEncoder() performs one hot encoding. One hot encoding consists in replacing the categorical variable by a group of binary variables which take...
Read more >
HR-analytics - Deepnote
The data we will be looking at includes 11 different features: ... Our categorical column we want to OneHotEncoder cat_cols = ['city', ...
Read more >
Pipelines — EvalML 0.64.0 documentation - Alteryx
E.g. A pipeline with an imputer, one-hot encoder, and logistic regression ... where a feature might have been dropped or detecting unexpected behavior....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found