Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

A proposal for PIN 4

See original GitHub issue

This is an evolution of a conversation @cicdw and I started last night. This is not an actual PIN (yet) but it is a proposal that I’d invite comment on, and if it gains some momentum I’d submit it as a proper PIN. It is written in PIN format.

PIN-4 Result Handlers

2019-01-30 Jeremiah Lowin

Status

Proposed

Context

Task results are a key part of any data pipeline, and in Prefect they have special meaning. Because Prefect tasks do not merely return data, but rather States, we have a wide variety of possible actions beyond simply passing the output of one task to the next task.

For example, we might allow a task to cache its output and return it in the future without repeating an expensive computation. We might detect that a state is going to be retried, and therefore cache all of its inputs so they’ll be available in the future. We might greedily write every task’s result to disc in order to pessimistically defend against node failures. We might even… simply pass the result to the next task.

Because of all these possible actions and the need to apply them to completely arbitrary Python objects, Prefect requires a great deal of result handling logic. This largely comes down to three buckets:

Moving results through the system
Knowing how to serialize results for remote storage
Serializing results for remote storage

Some of these are far more deceptive than they seem. For example, if task A passes a result to task B, and task B needs to retry, then task B is responsible for serializing the result of task A – after all, A has no way of knowing that its result will be reused in the future. This may be surprising to someone unfamiliar with Prefect (and fortunately is an internal detail users don’t need to know!). This implies that it isn’t enough for A to know how to serialize its own results; B must know how to serialize A’s result as well.

As I write this, Prefect is doing a great job of handling all of the above cases. PIN-2 introduced many of the required mechanisms for doing this in a distributed way, interacting with remote servers. However, the logic for working with results has spread across a few classes – it manifests with special methods in the TaskRunner class (where results must be parsed out of upstream_states), a series of ResultHandler objects, methods for setting/getting results on State objects, and a complex State._metadata object for tracking how to serialize results as well as whether they have, in fact, been serialized. For example, consider PRs #581 and #595.

It is desirable to consolidate all of this state handling logic into a single location, as much as possible. This has a few benefits:

single object to test and work with, with a known API
type safety and introspection via mypy
more formal contracts for serializing results via marshmallow schemas
reduction of development burden, since contributors can write methods that work with results without needing to also know exactly how to work with those results.

Decision

Result class

We will implement a Result class that contains information about:

the value of a task’s result
whether that value has been serialized
how to serialize or deserialize the result

Note: throughout this document I use the terms “serialized” and “serializer” where currently we use “raw” and “result_handler”, because the current terms seem inconsistent if they are both attributes of the same object. Another alternative – “handled” and “handler” – seems insufficiently descriptive.

The signature of this object is:

class Result:
    value: Any
    serialized: bool
    serializer: `ResultSerializer`

    def serialize(self) -> Result:
        """
        Return a new Result that represents a serialized version of this Result.
        """
        value = self.value
        if not self.serialized:
            value = self.serializer.serialize(self.value)
        return Result(value=value, serialized=True, serializer=self.serializer)

    def deserialize(self) -> Result:
        """
        Return a new Result that represents a deserialized version of this Result.
        """
        value = self.value
        if self.serialized:
            value = self.serializer.deserialize(self.value)
        return Result(value=value, serialized=False, serializer=self.serializer)


class ResultSerializer:

    def serialize(self, value: Any) -> JSONLike:
        """serialize a result to a JSON-compatible object"""

    def deserialize(self, blob: JSONLike) -> Any:
        """deserialize a result from a JSON-compatible object"""

State

We will change states to have the following signature (only showing changed attributes):

class State:
    _result: Result
    cached_inputs: Dict[str, Result]
    cached_result: Result

    @property
    def result(self) -> Any:
        """
        This property is to maintain the current user-friendly API in which state.result is
        a Python value, not a Prefect internal object
        """
        return self._result.value

The _metadata object is removed.

We will also modify the various StateSchema objects to use a new ResultSchema object as a nested field. This ResultSchema will automatically call Result.serialize() whenever it is asked to dump() a Result with serialized=False, and automatically call Result.deserialize() whenever load()-ing a Result with serialized=True.

This means that user code should never have to actually serialize or deserialize results. It can work with them assuming they are deserialized (what we currently call “raw”) and know that the Prefect serialization mechanisms will automatically apply the correct ResultSerializer whenever a state is serialized for transport, for example to Cloud.

TaskRunner

These changes would dramatically simplify the various methods in the TaskRunner class. For example, right now, after assigning state.cached_inputs = <inputs>, the user must then call state.update_input_metadata(<upstream_states>) in order to hydrate the metadata for those inputs, based on information loaded out of the task’s upstream states.

This new setup would allow simply setting cached inputs and moving on:

inputs = {
    'x': upstream_task_1._result, #type: Result
    'y': upstream_task_2._result, #type: Result
}
state.cached_inputs = inputs

Now, the cached_inputs attribute of that state has all the information required to serialize the results that were received from the upstream state.

Given the above inputs dictionary, calling the task’s run() method could just be:

value = task.run(**{k: v.value} for k, v in inputs.items())
result = Result(value=value, serialized=False, serializer=task.serializer)

And caching that result:

state.cached_result = result

FlowRunner

Despite all of this, the FlowRunner would require no changes. It would return an object that could still be accessed in the exact same way we do today:

state = flow.run(return_tasks=flow.tasks)

# the value returned by task "x"
state.result[x].result

CloudTaskRunner

If the marshmallow schemas for states were properly configured to serialize Results (as described in the State section, above), then the CloudTaskRunner would have to do no new work to properly sanitize state results for transport. Simply serializing the State object would take care of all necessary work.

It might be undesirable to aggressively DEserialize results every time a state is deserialized, however, so we could consider a context kwarg to affirm that behavior.

Consequences

Adopting these new classes would dramatically simplify a great deal of logic that is very important to how Prefect operates. Rather than distributing that logic throughout the State class, TaskRunner class, and CloudTaskRunner class; it would live in a single place: the Result object. Moreover, while the Result object contains all information for properly serializing/deserializing its value, the actual calling of the serialization methods could be delegated almost exclusively to Prefect’s existing schema serializers. Users (probably) would never have to do this by hand.

Issue Analytics

State:
Created 5 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

jlowincommented, Jan 31, 2019

Ah, very interesting.

So reverting to the question from #591, the logic could be:

Create inputs from cached_inputs
Update inputs from upstream_states AS LONG AS it’s not a NoResult.

Very nice.

1reaction

jlowincommented, Jan 31, 2019

@joshmeek:

Yes, I was just trying to sketch out a hypothetical State that had all of these attributes. I’ll clarify what I meant in the body of the PIN.
I’m only contemplating one, but I guess any class that appropriately implemented this contract would work. If Results are created by other functions, it’s simpler if there’s only one possible class to create. However @cicdw raised an interesting point in #591 that a NoResult could be represented by a class. (There’s a larger question there of whether we can have “no result”, or whether Python’s implicit None return is always a result…)
Correct, so its just a dict whose keys must match the kwargs of the task.
Sure, it’s a little out of scope for this to say exactly how that would work but basically if deserializing a state object also automatically retrieved its result, then you could end up loading a ton of data inadvertently (for example, flow runs start by loading all states. we don’t necessarily want to download all data then, too). So I think a final implementation of this would have a way to defer deserialization (or maybe just not deserialize the result automatically at all).