One Hot Encoder: Drop one redundant feature by default for features with two categories
See original GitHub issueOur one hot encoder creates a feature for every level of the original categorical feature:
from evalml.pipelines import OneHotEncoder
import pandas as pd
df = pd.DataFrame({"category": ["a", "b"], "number": [4,5 ]})
OneHotEncoder().fit_transform(df).to_dataframe()
The category_a
and category_b
columns are completely collinear, which makes one redundant. This could have adverse effects on estimator fitting. I think we should drop one by default.
FYI @rpeck
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
OneHotEncoder — 1.5.2 - Feature-engine
By default, the OneHotEncoder() will return both binary variables from “Gender”: “female” and “male”. When a categorical variable has only 2 categories, like...
Read more >Drop the first category from binary features (only ... - YouTube
... drop ='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has exactly two categories )......
Read more >Dropping one of the columns when using one-hot encoding
Recently someone pointed out that when you do one-hot encoding on a categorical variable you end up with correlated features, so you should...
Read more >sklearn.preprocessing.OneHotEncoder
Encode categorical features as a one-hot numeric array. ... By default, the encoder derives the categories based on the unique values in each...
Read more >Ordinal and One-Hot Encodings for Categorical Data
The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Third Law of Code: Thou Shalt Not Make == Comparisons With Floats
Post-discussion with @freddyaboulton @rpeck @dsherry @chukarsten @jeremyliweishih