Factory-style construction of composite estimators
See original GitHub issueI find the API for specifying lists constituent estimators in Pipeline, FeatureUnion, ColumnTransformer, StackingClassifier, VotingClassifier, GeneralizedNB, etc. ugly. I find the make_pipeline
, make_column_transformer
etc interfaces helpful, but the duplication of interfaces is actually quite ugly too, and it is limiting: the moment you want to name one of your Pipeline steps explicitly, in order to make your grid search parameter space reasonably intelligible, you need to change from make_pipeline
syntax to Pipeline
syntax. Yuck!
I’d like to see a POC implementation and likely a SLEP towards creating a syntax like:
est = (ColumnTransformer()
.append(CountVectorizer(), 'body_text')
.append(KNNImputer(), ['age'], name='impute')
.fit(X))
The append
method would progressively construct the transformers
parameter of est
. Note that the first use of append does not give the body_text transfomer a name, so one will be automatically generated.
Note that since we delay parameter validation until fit
, similar syntax is already possible:
est = (LogisticRegression()
.set_params(penalty='l1')
.fit(X, y))
So additional methods like append
are just syntactic sugar for getting, amending and setting the steps/transformers parameter. But I think they would encourage named parameters instead of confusing tuples, creating more legible code.
Each of the above estimators would have methods similar to append
or add
(perhaps also insert
).
PS: I have mentioned this proposal before, particularly when we ran into the inconsistent tuple ordering between make_column_transformer
and ColumnTransformer
.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:22 (22 by maintainers)
Top GitHub Comments
Maybe we should mention this at the next dev meeting. I don’t think many people saw this issue.
That’s a good point… I think that if we add a “better” API, we should change the examples/tutorial to use it. Where a method is added to control a parameter, we should note in the parameter docstring that the method is preferred. This specific proposal might involve deprecating make_pipeline etc which are convenient, but add some obscurity.