Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Functional helpers on XComArg and task expansion against 'zipped' inputs

See original GitHub issue

A need has been established to expand a task from aggregate inputs from different upstream tasks in a “zipped” fashion, rather than cartesian product as implemented by expand(). This is currently doable for taskflow tasks, by injecting a task to perform the zip operation, and unpack the zipped values manually:

@task
def up_1():
    return ["a", "b"]

@task
def up_2():
    return ["x", "y"]

@task
def aggregate(v1, v2):  # The intermediate job to zip.
    return list(zip(v1, v2))

@task
def output(v):
    print(v)

aggregated = aggergate(v1=up_1(), v2=up_2())

output.expand(v=aggregated)
# Creates two mapped tasks printing
# ["a", "x"]
# ["b", "y"]

However, the same cannot be done to a classic operator class, since Python does not offer a way to unpack the zipped item into keyword arguments.

Furthermore, the intermediate aggregation task (that performs zip) can introduce significant overhead, since it needs to create a new process, and load the entire lists from upstreams, which is one of the things task mapping aim to avoid if possible.

Implement `expand_kwargs` on classic operators

To make zip unpacking possible for classic operators, we will add a new method expand_kwargs that works similar to expand, but takes one single list of dicts instead. The task expands against the list, and unpack each dict into the operator’s keyword arguments. For example:

@task
def to_dict(v):
    return {"namespace": v[0], "image": v[1]}

kwargs = to_dict(v=aggregated)
KubernetesPodOperator.partial(task_id="kube").expand_kwargs(kwargs)
# Produces two mapped tasks:
# KubernetesPodOperator(namespace="a", image="x")
# KubernetesPodOperator(namespace="b", image="y")

Also note that since it is already possible to perform zip unpacking in taskflow tasks, @task will not receive the same expand_kwargs addition for now, although it is a conceivable addition if we feel API consistency is important.

The name of the function is undecided. expand_unpack is another obvious choice. We can change this easily enough before 2.4 is out of the door, so let’s not let this block implementation. Ideas on alternative names are very welcomed.

(Note: There used to be a section about alternative API here, but I deleted it since the design isn’t very appealing, and the section’s existence confuses what we actually want to do.)

Implement `map()` on `XComArg`

To reduce overhead to convert upstream values into a zipped dict, we will implement additional methods on XComArg. This will be done by adding a “modifier” callable on XComArg, that is called before the downstream consumer is executed, when the XComArg is being resolved. This means we can rewrite the above to

def to_dict(v):
    return {"namespace": v[0], "image": v[1]}

kwargs = aggregated.map(to_dict)
KubernetesPodOperator.partial(task_id="kube").expand_kwargs(kwargs)
# Same result as above!

Note that to_dict in this version is not a task. This essentially “merges” the previous to_dict task into the KubernetesPodOperator task and eliminates the processes needed for the trivial operation.

It is also conceivable to implement other functional and iterating helpers, such as chain and flat_map(), to XComArg in the future.

Implement `zip` on `XComArg`

An optional optimisation is to also make zipping happen in the scheduler, by implementing the operation as a first-class syntax:

aggregated = up_1().zip(up_2())

Besides eliminating the intermediate aggregate task, it may also save memory in the worker, since the zip() function in the scheduler does not need to load the upstream return values in their entirety, but can pull only one individual value at a time (from each list) into the corresponding downstream mapped task. This, however, requires the XComArg implementation to be extended: currently an XComArg depends on exactly only one upstream; to support zipping, we must extend the implementation to allow depending on multiple upstream tasks (in this example, aggregated depends on both up_1 and up_2).

Issue Analytics

State:
Created a year ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

uranusjrcommented, Jun 21, 2022

Forgot to mention here: I also wrote some documentation additions for this, with some more useful examples: #24489. Feel free to propose any suggestions there as well.

0reactions

uranusjrcommented, Aug 8, 2022

This has been implemented.