question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Include drop='last' to OneHotEncoder

See original GitHub issue

Describe the workflow you want to enable

When using SimpleImputer + OneHotEncoder, I am able to add a new constant category for NaN values like the example below:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])

df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)

# array([[0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]])

However, I wanted to have an argument like OneHotEncoder(drop='last') in order to have an output like:

array([[0., 1.],
       [1., 0.],
       [0., 0.]])

This would allow all NaNs to be filled with zeros.

Describe your proposed solution

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(drop='last'))])

Describe alternatives you’ve considered, if relevant

There’s no good alternative for compatibility with sklearn’s pipelines. I was following the issue #11996 of adding a handle_missing to OneHotEncoder but it has been ignored in favor of using a “constant” strategy on the categorical columns. But the constant strategy will add an unnecessary new column that could be dropped in this scenario.

Additional context

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
lorentzenchrcommented, Jun 5, 2022

IIRC, R‘s formula drops the first level by default. That corresponds to our option "first". Having first, why not have an option "last". Edit: maybe not worth it.

More striking is the argument to have "most_frequent". This is a strategy that I‘ve seen in other GLM software. Model coefficients are then the effect relative to this most frequent level which seems the most natural choice (but is irrelevant for most other things).

We could also consider to support sample weights with “most_frequent”, i.e. choose the level with highest sum of sample weights.

To the best of my knowledge, for linear models with penalties, one should never drop a level. For unpenalized linear models, the converse is true: always drop one level (otherwise one has perfect collinearity and solvers have a much harder job finding the unique minimum norm solution).

1reaction
lestevecommented, May 25, 2022

There is a drop argument in OneHotEncoder which you can pass a array to (one category to drop for each feature), can you use this for you use case? Adapting your snippet, something like this:

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame(['Male', 'Female', np.nan])
ohe = OneHotEncoder(drop=[np.nan])
ohe.fit_transform(df).toarray()

Output:

array([[0., 1.],
       [1., 0.],
       [0., 0.]])
Read more comments on GitHub >

github_iconTop Results From Across the Web

Why does Spark's OneHotEncoder drop the last category by ...
The last category is not included by default (configurable via dropLast) because it makes the vector entries sum up to one, and hence...
Read more >
sklearn.preprocessing.OneHotEncoder
Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, ......
Read more >
OneHotEncoder (Spark 1.4.0 JavaDoc)
Experimental :: A one-hot encoder that maps a column of category indices to a column of ... dropLast because it makes the vector...
Read more >
org.apache.spark.ml.feature.OneHotEncoder.dropLast java ...
Best Java code snippets using org.apache.spark.ml.feature.OneHotEncoder.dropLast (Showing top 1 results out of 315). origin: jpmml/jpmml-sparkml ...
Read more >
Role of OneHotEncoder and Pipelines in PySpark ML Feature
The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found