PolynomialFeatures always generates all combinations with degree less than the degree parameter
See original GitHub issueProblem
Currently, PolynomialFeatures states:
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
The problem with that is that we often only need the combinations of the features with degree strictly equal to the specified degree.
For instance:
Let’s say that we set interaction_only=True
.
Let’s say that we have 3 input features: A, B, C
.
PolynomialFeatures will generate 7 features: 1, A, B, C, AB, AC, BC
.
But we may want to generate only AB, AC, BC
.
Background
According to my humble experience, PolynomialFeatures isn’t flexible enough to be useful in many practical cases.
Related issue:
A few days ago, this issue was opened to suggest similar changes. The issue suggests to add a combinations
parameter that would allow the user to specify manually the combinations that will generate the polynomial features.
In my opinion, although this idea is more powerful than what I suggest in this issue, in many cases it might be tedious for the user to generate the combinations.
So, I believe that if the combinations
parameter is added, it would be great to add a helper function to facilitate the generation of the combinations.
Proposed solution
Flexible solution
Let the user pass a list as degree
parameter. For instance, the user could do: PolynomialFeatures(degree=[2, 3], interaction_only=True)
to generate the combinations of the features with degree strictly equal to 2
or 3
.
The problem I see with this idea is that passing degree=[2]
would not have the same behaviour as the current behaviour when we use degree=2
. So, it might be confusing for the user. Perhaps we could add a degrees
parameter and leave degree
unchanged? I’m open to ideas regarding this small issue.
Minimalist solution
Currently, we call PolynomialFeatures._combinations
which generates the combinations starting from:
- degree 0 if
include_bias = True
- and degree 1 if
include_bias = False
.
Instead of include_bias
, we could use a new min_degree
parameter which would be an integer less or equal to degree
.
i.e:
def _combinations(n_features, degree, interaction_only, include_bias):
...
start = int(not include_bias)
...
would become something like:
def _combinations(n_features, degree, interaction_only, min_degree=0):
...
start = min_degree
...
So, to solve the example given in the “Problem” paragraph, we would simply call:
PolynomialFeatures(degree=2, interaction_only=True, min_degree=2)
Alternatives
To achieve the same goal, I see two possible options:
- We could filter the output columns to remove the generated features that we don’t need, but this is a waste of memory and computing power.
- We could override PolynomialFeatures locally to change the behaviour of
_combinations
, or implement the custom preprocessor from scratch.
I hope that my explanations make sense. Thank you for your time,
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (7 by maintainers)
Top GitHub Comments
To clarify the current behavior of
interaction_only=True
:I find that this behavior is what I often want by default (to add interactions, for instance after a
SplineTransform
step in a pipeline) but I agree the name of the option is a bit confusing.Maybe we could have specific options to:
and then deprecate the
interaction_only
option.Instead of deprecating
degree
, could we just extend it to also accept a tuple, i.e.degree=(min_degree, max_degree)
?