question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

FYI/RFC: MLR Pipeline infrastructure and fit.transform vs fit_transform

See original GitHub issue

I recently talked with the authors of the MLR package for R. The just changed their pipeline infrastructure to something quite similar to ours - though they developed it independently. See this for details: https://mlr3pipelines.mlr-org.com

I wanted to point out one particular aspect of their design. We have been having issues with fit().transform() vs fit_transform, for example in stacking and resampling. They completely avoid the issue by having fit produce a representation of the dataset. You could say they don’t have fit, they only have fit_transform, though it’s just called fit basically. That makes it very obvious that there are two separate transformation implemented by each model: the one on the training set, and the one on the test set, and no confusion can arise. The method names are very different, so there is no expectations that the two would ever produce the same result.

I’m not sure this is a route we want to consider, but it would remove a lot of pain points we’re currently seeing, and it seems like a cleaner design to me.

If we wanted to do something like that in sklearn we either need to stop returning self in fit which might be too much of a break in the API, even for a 1.0 release. The other option would be to come up with another verb, and have fit_verb, which we kind of have for fit_resample. Though in principal it could be fit_transmogrify and do arbitrary things (like stacking).

I found it quite fascinating that months of our discussions would just resolve if we hadn’t decided to have fit return self

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Nov 10, 2019

But it would be rare to use KNNImputer and not want the training data imputed, so it’s okay that it’s expensive.

There certainly seem to be nice things about this design.

0reactions
adrinjalalicommented, Aug 22, 2021

Moving to 2.0.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Sklearn Objects | fit() vs transform() vs fit_transform() vs predict()
In this article, we learn different types of Sklearn objects. And see difference between fit() vs transform() vs fit_transform() vs ...
Read more >
What's the difference between fit and fit_transform in scikit ...
In summary, fit performs the training, transform changes the data in the pipeline in order to pass it on to the next stage...
Read more >
Difference Between fit(), transform(), fit_transform() and predict ...
Hello All,iNeuron is coming up with the Affordable Advanced Deep Learning, Open CV and NLP(DLCVNLP) course. This batch is starting from 17th ...
Read more >
What is the difference between "fit" and "transform"? - YouTube
" fit ": transformer learns something about the data " transform ": it uses what it learned to do the data transformation New...
Read more >
python 3.x - fit vs fit_transform in pipeline - Stack Overflow
Now while calling pipeline.fit , it first fit features and transform it, then fit the classifier on the transformed features.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found