FYI/RFC: MLR Pipeline infrastructure and fit.transform vs fit_transform
See original GitHub issueI recently talked with the authors of the MLR package for R. The just changed their pipeline infrastructure to something quite similar to ours - though they developed it independently. See this for details: https://mlr3pipelines.mlr-org.com
I wanted to point out one particular aspect of their design.
We have been having issues with fit().transform() vs fit_transform, for example in stacking and resampling.
They completely avoid the issue by having fit
produce a representation of the dataset. You could say they don’t have fit
, they only have fit_transform
, though it’s just called fit
basically.
That makes it very obvious that there are two separate transformation implemented by each model: the one on the training set, and the one on the test set, and no confusion can arise.
The method names are very different, so there is no expectations that the two would ever produce the same result.
I’m not sure this is a route we want to consider, but it would remove a lot of pain points we’re currently seeing, and it seems like a cleaner design to me.
If we wanted to do something like that in sklearn we either need to stop returning self
in fit
which might be too much of a break in the API, even for a 1.0 release.
The other option would be to come up with another verb, and have fit_verb
, which we kind of have for fit_resample
. Though in principal it could be fit_transmogrify
and do arbitrary things (like stacking).
I found it quite fascinating that months of our discussions would just resolve if we hadn’t decided to have fit
return self
…
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:5 (5 by maintainers)
But it would be rare to use KNNImputer and not want the training data imputed, so it’s okay that it’s expensive.
There certainly seem to be nice things about this design.
Moving to 2.0.