question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One Hot Encoder: Drop one redundant feature by default for features with two categories

See original GitHub issue

Our one hot encoder creates a feature for every level of the original categorical feature:

from evalml.pipelines import OneHotEncoder
import pandas as pd
df = pd.DataFrame({"category": ["a", "b"], "number": [4,5 ]})
OneHotEncoder().fit_transform(df).to_dataframe()

image

The category_a and category_b columns are completely collinear, which makes one redundant. This could have adverse effects on estimator fitting. I think we should drop one by default.

FYI @rpeck

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
rpeckcommented, Mar 5, 2021

Third Law of Code: Thou Shalt Not Make == Comparisons With Floats

2reactions
angela97lincommented, Mar 16, 2021

Post-discussion with @freddyaboulton @rpeck @dsherry @chukarsten @jeremyliweishih

  • We will only do this for binary cases.
  • A “nice-to-have” is to use, in the binary case, is the minority class, but otherwise just choosing one of the two categories should suffice.
Read more comments on GitHub >

github_iconTop Results From Across the Web

OneHotEncoder — 1.5.2 - Feature-engine
By default, the OneHotEncoder() will return both binary variables from “Gender”: “female” and “male”. When a categorical variable has only 2 categories, like...
Read more >
Drop the first category from binary features (only ... - YouTube
... drop ='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has exactly two categories )......
Read more >
Dropping one of the columns when using one-hot encoding
Recently someone pointed out that when you do one-hot encoding on a categorical variable you end up with correlated features, so you should...
Read more >
sklearn.preprocessing.OneHotEncoder
Encode categorical features as a one-hot numeric array. ... By default, the encoder derives the categories based on the unique values in each...
Read more >
Ordinal and One-Hot Encodings for Categorical Data
The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found