Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

pickle cache of submitted functions generate object collusion

See original GitHub issue

TLDR: distributed.worker uses a pickle cache when loading submit’ed functions. This cache is hit even for impure, stateful callables whose execution will have side effects on themselves.

Here is a simple example illustrating this:

import threading

from distributed import Client, LocalCluster


class Func:
    def __init__(self):
        self._thread_idents = []

        # Uncomment this line and the test passes (reason: cache_loads)
        # self._data = b'\x00' * int(1e5)

    def __call__(self):
        self._thread_idents.append(threading.get_ident())  # side effect  # noqa
        return self


if __name__ == "__main__":
    # note how each f in fs is a different object.
    fs = [Func() for _ in range(10)]

    cluster = LocalCluster(n_workers=1, threads_per_worker=4)
    client = Client(cluster)

    # setting pure=False because of anticipated side effect of f.__call__
    futures = [client.submit(f, pure=False) for f in fs]
    results = [f.result() for f in futures]

    # r._thread_indents should be incremented only one time for each r in
    # results.
    assert all(len(r._thread_idents) == 1 for r in results)

Typical real life use-case

Using scikit-learn: calling [client.submit(clone(estimator).fit, X, y) for _ in range(5)]. Because each submitted task has the same exact pickle representation, distributed will use its cache and treat all 5 estimators as one. (provided there is only 1 distributed worker).

Do you consider that to be a bug? I would have guessed using pure=False in client.submit would solve this, but it does not.

cc’ing @ogrisel @tomMoral. Cheers!

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

mrocklincommented, Apr 26, 2020

The pure=False keyword is poorly named, and in general a decision that we (I) regret. It should really be deterministic=False. Pure means too many things in order to be a good keyword name.

0reactions

mrocklincommented, Apr 26, 2020

Pardon my bluntness, but I have to admit that at first sight, the choice to use a cache - switched on or off with a hardcoded bytes threshold in loads_function looks like an implementation detail to me, and should not affect the behavior of the scheduler/worker.

Bluntness is very welcome 😃

You’re right that we could relax the assumption that functions don’t mutate inputs and your desired workflow would probably work well today. The challenge here is that there are some situations beyond your control where it wouldn’t work. Consider the following example:

est = client.submit(MyEstimator, *args)
for X, y in batches:
    est = client.submit(MyEstimator.partial_fit, est, X, y)
predictions = client.submit(MyEstimator.predict, est, data)

But now imagine that partway through training something happens and we have to start over. Maybe we still have a pointer to the original est, but we’re not sure how many partial_fit calls we’ve done so far. Today, because Dask assumes that functions don’t modify inputs, it can replay the computation from wherever things broke. If we don’t make that assumption then we don’t know where to start. Resilience and in-place are hard to do at the same time.

Or alternatively …

est = client.submit(MyEstimator, *args)

a = client.submit(MyEstimator.partial_fit, X1, y1)
predictions_a = client.submit(MyEstimator.predict, a, data_a)

b = client.submit(MyEstimator.partial_fit, X2, y2)
predictions_b= client.submit(MyEstimator.predict, b, data_b)

Because we’re operating in-place there is now an implicit dependency that predictions_a needs to run before b. Otherwise we’re performing the predictions on a with the additional fitted data of b. Parallelism and in-place are hard to do safely without more information from the user.

So we make the assumption that user tasks don’t modify inputs. You’re right that the arbitrary size check is odd in general. However it’s safe if we make the assumption that user code doesn’t rely on side effects or in-place operation. In general we’ve chosen to make that assumption, because we have to anyway for things to work.

Top Results From Across the Web

pickle — Python object serialization — Python 3.11.1 ...

The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python ...

How can I pickle a dynamically created nested class in python?

For the record, this works like a charm for dynamically generated classes (e.g. classes returned by a function), too. Although we should be ......

A tip for the impatient: Simple caching with Python pickle and ...

The Python module pickle is perfect for caching, since it allows to store and read whole Python objects with two simple functions.

Troubleshoot Python errors in Azure Functions - Microsoft Learn

If your function app is using the Python pickle library to load a Python object from a .pkl file, it's possible that the...

CVE - Search Results - MITRE

An authenticated attacker can inject arbitrary pickle object as part of a ... to write their payload to the cache and generate the...