question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Factory-style construction of composite estimators

See original GitHub issue

I find the API for specifying lists constituent estimators in Pipeline, FeatureUnion, ColumnTransformer, StackingClassifier, VotingClassifier, GeneralizedNB, etc. ugly. I find the make_pipeline, make_column_transformer etc interfaces helpful, but the duplication of interfaces is actually quite ugly too, and it is limiting: the moment you want to name one of your Pipeline steps explicitly, in order to make your grid search parameter space reasonably intelligible, you need to change from make_pipeline syntax to Pipeline syntax. Yuck!

I’d like to see a POC implementation and likely a SLEP towards creating a syntax like:

est = (ColumnTransformer()
       .append(CountVectorizer(), 'body_text')
       .append(KNNImputer(), ['age'], name='impute')
       .fit(X))

The append method would progressively construct the transformers parameter of est. Note that the first use of append does not give the body_text transfomer a name, so one will be automatically generated.

Note that since we delay parameter validation until fit, similar syntax is already possible:

est = (LogisticRegression()
       .set_params(penalty='l1')
       .fit(X, y))

So additional methods like append are just syntactic sugar for getting, amending and setting the steps/transformers parameter. But I think they would encourage named parameters instead of confusing tuples, creating more legible code.

Each of the above estimators would have methods similar to append or add (perhaps also insert).

PS: I have mentioned this proposal before, particularly when we ran into the inconsistent tuple ordering between make_column_transformer and ColumnTransformer.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:4
  • Comments:22 (22 by maintainers)

github_iconTop GitHub Comments

2reactions
rthcommented, Jul 25, 2020

Maybe we should mention this at the next dev meeting. I don’t think many people saw this issue.

1reaction
jnothmancommented, Jul 29, 2021

I find this interesting but I am concerned about supporting too many ways of achieving the same thing. I feel like this is what makes some libraries hard to use and learn.

That’s a good point… I think that if we add a “better” API, we should change the examples/tutorial to use it. Where a method is added to control a parameter, we should note in the parameter docstring that the method is preferred. This specific proposal might involve deprecating make_pipeline etc which are convenient, but add some obscurity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

6.1. Pipelines and composite estimators - Scikit-learn
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline.
Read more >
Composite Estimators for Small Area Estimation (Method)
In surveys conducted by statistical offices one of the main problems is to have reliable estimates for domains for which the sample size...
Read more >
RooFit Users Manual v2.91 - ROOT - CERN
Building composite models with fractions . ... for the use of (unbinned) maximum likelihood parameter estimation technique.
Read more >
1 Task Inequality and Racial Mobility over the Long Twentieth ...
how our composite measures were constructed. Routineness is the average of finger dexterity, motor, manual and form perception. The first three measure ...
Read more >
2. Composite estimation - Statistique Canada
Composite estimation Model-assisted optimal allocation for planned domains ... Composite estimators for small areas are defined as convex ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found