pickle cache of submitted functions generate object collusion
See original GitHub issueTLDR: distributed.worker
uses a pickle
cache when loading submit
’ed functions. This cache is hit even for impure, stateful callables whose execution will have side effects on themselves.
Here is a simple example illustrating this:
import threading
from distributed import Client, LocalCluster
class Func:
def __init__(self):
self._thread_idents = []
# Uncomment this line and the test passes (reason: cache_loads)
# self._data = b'\x00' * int(1e5)
def __call__(self):
self._thread_idents.append(threading.get_ident()) # side effect # noqa
return self
if __name__ == "__main__":
# note how each f in fs is a different object.
fs = [Func() for _ in range(10)]
cluster = LocalCluster(n_workers=1, threads_per_worker=4)
client = Client(cluster)
# setting pure=False because of anticipated side effect of f.__call__
futures = [client.submit(f, pure=False) for f in fs]
results = [f.result() for f in futures]
# r._thread_indents should be incremented only one time for each r in
# results.
assert all(len(r._thread_idents) == 1 for r in results)
Typical real life use-case
Using scikit-learn
: calling [client.submit(clone(estimator).fit, X, y) for _ in range(5)]
.
Because each submitted task has the same exact pickle
representation, distributed
will use its cache and treat all 5 estimators as one. (provided there is only 1 distributed
worker).
Do you consider that to be a bug? I would have guessed using pure=False
in client.submit
would solve this, but it does not.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
pickle — Python object serialization — Python 3.11.1 ...
The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python ...
Read more >How can I pickle a dynamically created nested class in python?
For the record, this works like a charm for dynamically generated classes (e.g. classes returned by a function), too. Although we should be ......
Read more >A tip for the impatient: Simple caching with Python pickle and ...
The Python module pickle is perfect for caching, since it allows to store and read whole Python objects with two simple functions.
Read more >Troubleshoot Python errors in Azure Functions - Microsoft Learn
If your function app is using the Python pickle library to load a Python object from a .pkl file, it's possible that the...
Read more >CVE - Search Results - MITRE
An authenticated attacker can inject arbitrary pickle object as part of a ... to write their payload to the cache and generate the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The
pure=False
keyword is poorly named, and in general a decision that we (I) regret. It should really bedeterministic=False
. Pure means too many things in order to be a good keyword name.Bluntness is very welcome 😃
You’re right that we could relax the assumption that functions don’t mutate inputs and your desired workflow would probably work well today. The challenge here is that there are some situations beyond your control where it wouldn’t work. Consider the following example:
But now imagine that partway through training something happens and we have to start over. Maybe we still have a pointer to the original
est
, but we’re not sure how manypartial_fit
calls we’ve done so far. Today, because Dask assumes that functions don’t modify inputs, it can replay the computation from wherever things broke. If we don’t make that assumption then we don’t know where to start. Resilience and in-place are hard to do at the same time.Or alternatively …
Because we’re operating in-place there is now an implicit dependency that
predictions_a
needs to run beforeb
. Otherwise we’re performing the predictions ona
with the additional fitted data ofb
. Parallelism and in-place are hard to do safely without more information from the user.So we make the assumption that user tasks don’t modify inputs. You’re right that the arbitrary size check is odd in general. However it’s safe if we make the assumption that user code doesn’t rely on side effects or in-place operation. In general we’ve chosen to make that assumption, because we have to anyway for things to work.