Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues encountered with the memory option in Pipeline

See original GitHub issue

I’m very interested in using the memory option in Pipeline so I can organise my code in a simple fashion. However I’ve found that it does not scale well or there seem to be some caveats one could not expect:

is the input train data is too big (~several GB), joblib.memory takes a very long time to hash it, it can considerably slow the execution in an unexpected way.
the documentation hints that Caching the transformers is advantageous when fitting is time consuming.. However, the fit_transform is cached, so not only the fitted transformer but also the transformed data of the train data seems to be cached ? Then, it is also advantageous when transforming is time consuming, however this can quickly add up to a considerable space taken on the hard drive.
finally, it the code of a transformer change, but neither his methods nor his attributes, it seems to me that the hash will not change (because the code of _fit_transform_one does not). That’s something the user could be warned about (need to wipe the cache when the code of a currently cached transformer has been altered) or it can happen to load from the cache the previous version of a transformer by mistake.

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:10 (10 by maintainers)

Top GitHub Comments

2reactions

fcharrascommented, Nov 6, 2017

Maybe different issues should be opened with each of those points ? The issue with joblib loading results coming from a previously cached deprecated version of a transformer sounds important too as long as it’s not documented.

When building a Pipeline, is it too much tedious to pass a Memory object for each transformer?

I for one ended up writing a meta estimator CachedTransformer that would let me choose if I want to cache the fit, the transform or both (and if both fit_transform too). The code is heavier like this but I rarely need to cache more than 1 or 2 transformers in a pipeline so I found it acceptable.

hashing the input once (it could still take some times) or a subset of the input data

Both options sound very good.

0reactions

jnothmancommented, Nov 29, 2017

You’re only likely to get benefits on something that has a slow fit or transform or both operations. StandardScaler is about as cheap as it gets. Try it with a CountVectorizer.

Top Results From Across the Web

Running a pipeline produces Out of Memory error

Thanks for reporting the issue on Developer Community. The hosted machines guarantee a minimum amount of free memory. If your build requirements exceed...

Troubleshoot Dataflow out of memory errors - Google Cloud

This page provides information about memory usage in Dataflow pipelines and steps for investigating and resolving issues with Dataflow out of memory (OOM) ......

Troubleshoot pipeline runs - Azure DevOps - Microsoft Learn

When you run pipelines on multiple platforms, you can sometimes encounter problems with different line endings. Historically, Linux and macOS ...

Improving pipeline performance: Process memory allocation ...

My first thought was that one of the tasks was stuck at some Gradle process or it was having some memory issue so...

[Bug]: Memory consumption issues on Node JS 16.11.0+ ...

We had some issues with Jest workers consuming all available RAM both on CI machine and locally. After doing some research, we found...