question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues encountered with the memory option in Pipeline

See original GitHub issue

I’m very interested in using the memory option in Pipeline so I can organise my code in a simple fashion. However I’ve found that it does not scale well or there seem to be some caveats one could not expect:

  • is the input train data is too big (~several GB), joblib.memory takes a very long time to hash it, it can considerably slow the execution in an unexpected way.

  • the documentation hints that Caching the transformers is advantageous when fitting is time consuming.. However, the fit_transform is cached, so not only the fitted transformer but also the transformed data of the train data seems to be cached ? Then, it is also advantageous when transforming is time consuming, however this can quickly add up to a considerable space taken on the hard drive.

  • finally, it the code of a transformer change, but neither his methods nor his attributes, it seems to me that the hash will not change (because the code of _fit_transform_one does not). That’s something the user could be warned about (need to wipe the cache when the code of a currently cached transformer has been altered) or it can happen to load from the cache the previous version of a transformer by mistake.

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:1
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
fcharrascommented, Nov 6, 2017

Maybe different issues should be opened with each of those points ? The issue with joblib loading results coming from a previously cached deprecated version of a transformer sounds important too as long as it’s not documented.

When building a Pipeline, is it too much tedious to pass a Memory object for each transformer?

I for one ended up writing a meta estimator CachedTransformer that would let me choose if I want to cache the fit, the transform or both (and if both fit_transform too). The code is heavier like this but I rarely need to cache more than 1 or 2 transformers in a pipeline so I found it acceptable.

hashing the input once (it could still take some times) or a subset of the input data

Both options sound very good.

0reactions
jnothmancommented, Nov 29, 2017

You’re only likely to get benefits on something that has a slow fit or transform or both operations. StandardScaler is about as cheap as it gets. Try it with a CountVectorizer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running a pipeline produces Out of Memory error
Thanks for reporting the issue on Developer Community. The hosted machines guarantee a minimum amount of free memory. If your build requirements exceed...
Read more >
Troubleshoot Dataflow out of memory errors - Google Cloud
This page provides information about memory usage in Dataflow pipelines and steps for investigating and resolving issues with Dataflow out of memory (OOM) ......
Read more >
Troubleshoot pipeline runs - Azure DevOps - Microsoft Learn
When you run pipelines on multiple platforms, you can sometimes encounter problems with different line endings. Historically, Linux and macOS ...
Read more >
Improving pipeline performance: Process memory allocation ...
My first thought was that one of the tasks was stuck at some Gradle process or it was having some memory issue so...
Read more >
[Bug]: Memory consumption issues on Node JS 16.11.0+ ...
We had some issues with Jest workers consuming all available RAM both on CI machine and locally. After doing some research, we found...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found