Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

One Hot Encoder: Drop one redundant feature by default for features with two categories

See original GitHub issue

Our one hot encoder creates a feature for every level of the original categorical feature:

from evalml.pipelines import OneHotEncoder
import pandas as pd
df = pd.DataFrame({"category": ["a", "b"], "number": [4,5 ]})
OneHotEncoder().fit_transform(df).to_dataframe()

The category_a and category_b columns are completely collinear, which makes one redundant. This could have adverse effects on estimator fitting. I think we should drop one by default.

FYI @rpeck

Issue Analytics

State:
Created 3 years ago
Comments:14 (7 by maintainers)

Top GitHub Comments

3reactions

rpeckcommented, Mar 5, 2021

Third Law of Code: Thou Shalt Not Make == Comparisons With Floats

2reactions

angela97lincommented, Mar 16, 2021

Post-discussion with @freddyaboulton @rpeck @dsherry @chukarsten @jeremyliweishih

We will only do this for binary cases.
A “nice-to-have” is to use, in the binary case, is the minority class, but otherwise just choosing one of the two categories should suffice.

Top Results From Across the Web

OneHotEncoder — 1.5.2 - Feature-engine

By default, the OneHotEncoder() will return both binary variables from “Gender”: “female” and “male”. When a categorical variable has only 2 categories, like...

Drop the first category from binary features (only ... - YouTube

... drop ='if_binary' with OneHotEncoder to drop the first category ONLY if it's a binary feature (meaning it has exactly two categories )......

Dropping one of the columns when using one-hot encoding

Recently someone pointed out that when you do one-hot encoding on a categorical variable you end up with correlated features, so you should...

sklearn.preprocessing.OneHotEncoder

Encode categorical features as a one-hot numeric array. ... By default, the encoder derives the categories based on the unique values in each...

Ordinal and One-Hot Encodings for Categorical Data

The one-hot encoding creates one binary variable for each category. The problem is that this representation includes redundancy. For example, if ...