A proposal for PIN 4
See original GitHub issueThis is an evolution of a conversation @cicdw and I started last night. This is not an actual PIN (yet) but it is a proposal that I’d invite comment on, and if it gains some momentum I’d submit it as a proper PIN. It is written in PIN format.
PIN-4 Result Handlers
2019-01-30 Jeremiah Lowin
Status
Proposed
Context
Task results are a key part of any data pipeline, and in Prefect they have special meaning. Because Prefect tasks do not merely return data, but rather States
, we have a wide variety of possible actions beyond simply passing the output of one task to the next task.
For example, we might allow a task to cache its output and return it in the future without repeating an expensive computation. We might detect that a state is going to be retried, and therefore cache all of its inputs so they’ll be available in the future. We might greedily write every task’s result to disc in order to pessimistically defend against node failures. We might even… simply pass the result to the next task.
Because of all these possible actions and the need to apply them to completely arbitrary Python objects, Prefect requires a great deal of result handling logic. This largely comes down to three buckets:
- Moving results through the system
- Knowing how to serialize results for remote storage
- Serializing results for remote storage
Some of these are far more deceptive than they seem. For example, if task A passes a result to task B, and task B needs to retry, then task B is responsible for serializing the result of task A – after all, A has no way of knowing that its result will be reused in the future. This may be surprising to someone unfamiliar with Prefect (and fortunately is an internal detail users don’t need to know!). This implies that it isn’t enough for A to know how to serialize its own results; B must know how to serialize A’s result as well.
As I write this, Prefect is doing a great job of handling all of the above cases. PIN-2 introduced many of the required mechanisms for doing this in a distributed way, interacting with remote servers. However, the logic for working with results has spread across a few classes – it manifests with special methods in the TaskRunner
class (where results must be parsed out of upstream_states
), a series of ResultHandler
objects, methods for setting/getting results on State
objects, and a complex State._metadata
object for tracking how to serialize results as well as whether they have, in fact, been serialized. For example, consider PRs #581 and #595.
It is desirable to consolidate all of this state handling logic into a single location, as much as possible. This has a few benefits:
- single object to test and work with, with a known API
- type safety and introspection via
mypy
- more formal contracts for serializing results via
marshmallow
schemas - reduction of development burden, since contributors can write methods that work with results without needing to also know exactly how to work with those results.
Decision
Result class
We will implement a Result
class that contains information about:
- the value of a task’s result
- whether that value has been serialized
- how to serialize or deserialize the result
Note: throughout this document I use the terms “serialized” and “serializer” where currently we use “raw” and “result_handler”, because the current terms seem inconsistent if they are both attributes of the same object. Another alternative – “handled” and “handler” – seems insufficiently descriptive.
The signature of this object is:
class Result:
value: Any
serialized: bool
serializer: `ResultSerializer`
def serialize(self) -> Result:
"""
Return a new Result that represents a serialized version of this Result.
"""
value = self.value
if not self.serialized:
value = self.serializer.serialize(self.value)
return Result(value=value, serialized=True, serializer=self.serializer)
def deserialize(self) -> Result:
"""
Return a new Result that represents a deserialized version of this Result.
"""
value = self.value
if self.serialized:
value = self.serializer.deserialize(self.value)
return Result(value=value, serialized=False, serializer=self.serializer)
class ResultSerializer:
def serialize(self, value: Any) -> JSONLike:
"""serialize a result to a JSON-compatible object"""
def deserialize(self, blob: JSONLike) -> Any:
"""deserialize a result from a JSON-compatible object"""
State
We will change states to have the following signature (only showing changed attributes):
class State:
_result: Result
cached_inputs: Dict[str, Result]
cached_result: Result
@property
def result(self) -> Any:
"""
This property is to maintain the current user-friendly API in which state.result is
a Python value, not a Prefect internal object
"""
return self._result.value
The _metadata
object is removed.
We will also modify the various StateSchema
objects to use a new ResultSchema
object as a nested field. This ResultSchema
will automatically call Result.serialize()
whenever it is asked to dump()
a Result
with serialized=False
, and automatically call Result.deserialize()
whenever load()
-ing a Result
with serialized=True
.
This means that user code should never have to actually serialize or deserialize results. It can work with them assuming they are deserialized (what we currently call “raw”) and know that the Prefect serialization mechanisms will automatically apply the correct ResultSerializer
whenever a state is serialized for transport, for example to Cloud.
TaskRunner
These changes would dramatically simplify the various methods in the TaskRunner
class. For example, right now, after assigning state.cached_inputs = <inputs>
, the user must then call state.update_input_metadata(<upstream_states>)
in order to hydrate the metadata for those inputs, based on information loaded out of the task’s upstream states.
This new setup would allow simply setting cached inputs and moving on:
inputs = {
'x': upstream_task_1._result, #type: Result
'y': upstream_task_2._result, #type: Result
}
state.cached_inputs = inputs
Now, the cached_inputs
attribute of that state has all the information required to serialize the results that were received from the upstream state.
Given the above inputs
dictionary, calling the task’s run()
method could just be:
value = task.run(**{k: v.value} for k, v in inputs.items())
result = Result(value=value, serialized=False, serializer=task.serializer)
And caching that result:
state.cached_result = result
FlowRunner
Despite all of this, the FlowRunner would require no changes. It would return an object that could still be accessed in the exact same way we do today:
state = flow.run(return_tasks=flow.tasks)
# the value returned by task "x"
state.result[x].result
CloudTaskRunner
If the marshmallow schemas for states were properly configured to serialize Results
(as described in the State section, above), then the CloudTaskRunner
would have to do no new work to properly sanitize state results for transport. Simply serializing the State
object would take care of all necessary work.
It might be undesirable to aggressively DEserialize results every time a state is deserialized, however, so we could consider a context kwarg
to affirm that behavior.
Consequences
Adopting these new classes would dramatically simplify a great deal of logic that is very important to how Prefect operates. Rather than distributing that logic throughout the State
class, TaskRunner
class, and CloudTaskRunner
class; it would live in a single place: the Result
object. Moreover, while the Result
object contains all information for properly serializing/deserializing its value, the actual calling of the serialization methods could be delegated almost exclusively to Prefect’s existing schema serializers. Users (probably) would never have to do this by hand.
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (4 by maintainers)
Ah, very interesting.
So reverting to the question from #591, the logic could be:
cached_inputs
upstream_states
AS LONG AS it’s not aNoResult
.Very nice.
@joshmeek:
Yes, I was just trying to sketch out a hypothetical
State
that had all of these attributes. I’ll clarify what I meant in the body of the PIN.I’m only contemplating one, but I guess any class that appropriately implemented this contract would work. If
Results
are created by other functions, it’s simpler if there’s only one possible class to create. However @cicdw raised an interesting point in #591 that aNoResult
could be represented by a class. (There’s a larger question there of whether we can have “no result”, or whether Python’s implicitNone
return is always a result…)Correct, so its just a dict whose keys must match the
kwargs
of the task.Sure, it’s a little out of scope for this to say exactly how that would work but basically if deserializing a state object also automatically retrieved its result, then you could end up loading a ton of data inadvertently (for example, flow runs start by loading all states. we don’t necessarily want to download all data then, too). So I think a final implementation of this would have a way to defer deserialization (or maybe just not deserialize the result automatically at all).