Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Factory-style construction of composite estimators

See original GitHub issue

I find the API for specifying lists constituent estimators in Pipeline, FeatureUnion, ColumnTransformer, StackingClassifier, VotingClassifier, GeneralizedNB, etc. ugly. I find the make_pipeline, make_column_transformer etc interfaces helpful, but the duplication of interfaces is actually quite ugly too, and it is limiting: the moment you want to name one of your Pipeline steps explicitly, in order to make your grid search parameter space reasonably intelligible, you need to change from make_pipeline syntax to Pipeline syntax. Yuck!

I’d like to see a POC implementation and likely a SLEP towards creating a syntax like:

est = (ColumnTransformer()
       .append(CountVectorizer(), 'body_text')
       .append(KNNImputer(), ['age'], name='impute')
       .fit(X))

The append method would progressively construct the transformers parameter of est. Note that the first use of append does not give the body_text transfomer a name, so one will be automatically generated.

Note that since we delay parameter validation until fit, similar syntax is already possible:

est = (LogisticRegression()
       .set_params(penalty='l1')
       .fit(X, y))

So additional methods like append are just syntactic sugar for getting, amending and setting the steps/transformers parameter. But I think they would encourage named parameters instead of confusing tuples, creating more legible code.

Each of the above estimators would have methods similar to append or add (perhaps also insert).

PS: I have mentioned this proposal before, particularly when we ran into the inconsistent tuple ordering between make_column_transformer and ColumnTransformer.

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:22 (22 by maintainers)

Top GitHub Comments

2reactions

rthcommented, Jul 25, 2020

Maybe we should mention this at the next dev meeting. I don’t think many people saw this issue.

1reaction

jnothmancommented, Jul 29, 2021

I find this interesting but I am concerned about supporting too many ways of achieving the same thing. I feel like this is what makes some libraries hard to use and learn.

That’s a good point… I think that if we add a “better” API, we should change the examples/tutorial to use it. Where a method is added to control a parameter, we should note in the parameter docstring that the method is preferred. This specific proposal might involve deprecating make_pipeline etc which are convenient, but add some obscurity.

Top Results From Across the Web

6.1. Pipelines and composite estimators - Scikit-learn

Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline.

Composite Estimators for Small Area Estimation (Method)

In surveys conducted by statistical offices one of the main problems is to have reliable estimates for domains for which the sample size...

RooFit Users Manual v2.91 - ROOT - CERN

Building composite models with fractions . ... for the use of (unbinned) maximum likelihood parameter estimation technique.

1 Task Inequality and Racial Mobility over the Long Twentieth ...

how our composite measures were constructed. Routineness is the average of finger dexterity, motor, manual and form perception. The first three measure ...

2. Composite estimation - Statistique Canada

Composite estimation Model-assisted optimal allocation for planned domains ... Composite estimators for small areas are defined as convex ...