question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PolynomialFeatures always generates all combinations with degree less than the degree parameter

See original GitHub issue

Problem

Currently, PolynomialFeatures states:

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

The problem with that is that we often only need the combinations of the features with degree strictly equal to the specified degree.

For instance: Let’s say that we set interaction_only=True. Let’s say that we have 3 input features: A, B, C. PolynomialFeatures will generate 7 features: 1, A, B, C, AB, AC, BC. But we may want to generate only AB, AC, BC.

Background

According to my humble experience, PolynomialFeatures isn’t flexible enough to be useful in many practical cases.

Related issue: A few days ago, this issue was opened to suggest similar changes. The issue suggests to add a combinations parameter that would allow the user to specify manually the combinations that will generate the polynomial features. In my opinion, although this idea is more powerful than what I suggest in this issue, in many cases it might be tedious for the user to generate the combinations. So, I believe that if the combinations parameter is added, it would be great to add a helper function to facilitate the generation of the combinations.

Proposed solution

Flexible solution

Let the user pass a list as degree parameter. For instance, the user could do: PolynomialFeatures(degree=[2, 3], interaction_only=True) to generate the combinations of the features with degree strictly equal to 2 or 3.

The problem I see with this idea is that passing degree=[2] would not have the same behaviour as the current behaviour when we use degree=2. So, it might be confusing for the user. Perhaps we could add a degrees parameter and leave degree unchanged? I’m open to ideas regarding this small issue.

Minimalist solution

Currently, we call PolynomialFeatures._combinations which generates the combinations starting from:

  • degree 0 if include_bias = True
  • and degree 1 if include_bias = False.

Instead of include_bias, we could use a new min_degree parameter which would be an integer less or equal to degree.

i.e:

def _combinations(n_features, degree, interaction_only, include_bias):
    ...
    start = int(not include_bias)
    ...

would become something like:

def _combinations(n_features, degree, interaction_only, min_degree=0):
    ...
    start = min_degree
    ...

So, to solve the example given in the “Problem” paragraph, we would simply call:

PolynomialFeatures(degree=2, interaction_only=True, min_degree=2)

Alternatives

To achieve the same goal, I see two possible options:

  • We could filter the output columns to remove the generated features that we don’t need, but this is a waste of memory and computing power.
  • We could override PolynomialFeatures locally to change the behaviour of _combinations, or implement the custom preprocessor from scratch.

I hope that my explanations make sense. Thank you for your time,

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

3reactions
ogriselcommented, May 25, 2021

To clarify the current behavior of interaction_only=True:

>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = pd.DataFrame(np.random.randn(100, 3), columns=list("abc"))
>>> PolynomialFeatures(degree=3, interaction_only=True).fit(X).get_feature_names(X.columns)
['1', 'a', 'b', 'c', 'a b', 'a c', 'b c', 'a b c']
>>> PolynomialFeatures(degree=3, interaction_only=False).fit(X).get_feature_names(X.columns)
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2', 'a^3', 'a^2 b', 'a^2 c', 'a b^2', 'a b c', 'a c^2', 'b^3', 'b^2 c', 'b c^2', 'c^3']

I find that this behavior is what I often want by default (to add interactions, for instance after a SplineTransform step in a pipeline) but I agree the name of the option is a bit confusing.

Maybe we could have specific options to:

  • include or exclude the bias term
  • include or exclude the original features
  • included or exclude interaction terms (more than one features)

and then deprecate the interaction_only option.

2reactions
NicolasHugcommented, May 25, 2021

I think deprecating degree and replacing it with max_degree would be acceptable.

Instead of deprecating degree, could we just extend it to also accept a tuple, i.e. degree=(min_degree, max_degree)?

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.preprocessing.PolynomialFeatures
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree....
Read more >
How to Use Polynomial Feature Transforms for Machine ...
How to use the polynomial features transform to create new versions of input variables for predictive modeling. How the degree of the ...
Read more >
Which combinations between features with a polynomial ...
The following Code prints all polynomial combinations of the features with degree less than or equal to 3. import numpy as np from...
Read more >
Polynomial Regression - Towards Data Science
This blog requires prior knowledge of Linear Regression. ... To generate polynomial features (here 2nd degree polynomial)
Read more >
PolynomialExtender - River
Polynomial feature extender. Generate features consisting of all polynomial combinations of the features with degree less than or equal to the specified degree....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found