Issues encountered with the memory option in Pipeline
See original GitHub issueI’m very interested in using the memory
option in Pipeline
so I can organise my code in a simple fashion. However I’ve found that it does not scale well or there seem to be some caveats one could not expect:
-
is the input train data is too big (~several GB), joblib.memory takes a very long time to hash it, it can considerably slow the execution in an unexpected way.
-
the documentation hints that
Caching the transformers is advantageous when fitting is time consuming.
. However, thefit_transform
is cached, so not only the fitted transformer but also the transformed data of the train data seems to be cached ? Then, it is also advantageous when transforming is time consuming, however this can quickly add up to a considerable space taken on the hard drive. -
finally, it the code of a transformer change, but neither his methods nor his attributes, it seems to me that the hash will not change (because the code of _fit_transform_one does not). That’s something the user could be warned about (need to wipe the cache when the code of a currently cached transformer has been altered) or it can happen to load from the cache the previous version of a transformer by mistake.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:10 (10 by maintainers)
Top GitHub Comments
Maybe different issues should be opened with each of those points ? The issue with joblib loading results coming from a previously cached deprecated version of a transformer sounds important too as long as it’s not documented.
I for one ended up writing a meta estimator
CachedTransformer
that would let me choose if I want to cache thefit
, thetransform
or both (and if bothfit_transform
too). The code is heavier like this but I rarely need to cache more than 1 or 2 transformers in a pipeline so I found it acceptable.Both options sound very good.
You’re only likely to get benefits on something that has a slow fit or transform or both operations. StandardScaler is about as cheap as it gets. Try it with a CountVectorizer.