question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Functional helpers on XComArg and task expansion against 'zipped' inputs

See original GitHub issue

A need has been established to expand a task from aggregate inputs from different upstream tasks in a “zipped” fashion, rather than cartesian product as implemented by expand(). This is currently doable for taskflow tasks, by injecting a task to perform the zip operation, and unpack the zipped values manually:

@task
def up_1():
    return ["a", "b"]

@task
def up_2():
    return ["x", "y"]

@task
def aggregate(v1, v2):  # The intermediate job to zip.
    return list(zip(v1, v2))

@task
def output(v):
    print(v)

aggregated = aggergate(v1=up_1(), v2=up_2())

output.expand(v=aggregated)
# Creates two mapped tasks printing
# ["a", "x"]
# ["b", "y"]

However, the same cannot be done to a classic operator class, since Python does not offer a way to unpack the zipped item into keyword arguments.

Furthermore, the intermediate aggregation task (that performs zip) can introduce significant overhead, since it needs to create a new process, and load the entire lists from upstreams, which is one of the things task mapping aim to avoid if possible.


Implement expand_kwargs on classic operators

To make zip unpacking possible for classic operators, we will add a new method expand_kwargs that works similar to expand, but takes one single list of dicts instead. The task expands against the list, and unpack each dict into the operator’s keyword arguments. For example:

@task
def to_dict(v):
    return {"namespace": v[0], "image": v[1]}

kwargs = to_dict(v=aggregated)
KubernetesPodOperator.partial(task_id="kube").expand_kwargs(kwargs)
# Produces two mapped tasks:
# KubernetesPodOperator(namespace="a", image="x")
# KubernetesPodOperator(namespace="b", image="y")

Also note that since it is already possible to perform zip unpacking in taskflow tasks, @task will not receive the same expand_kwargs addition for now, although it is a conceivable addition if we feel API consistency is important.

The name of the function is undecided. expand_unpack is another obvious choice. We can change this easily enough before 2.4 is out of the door, so let’s not let this block implementation. Ideas on alternative names are very welcomed.

(Note: There used to be a section about alternative API here, but I deleted it since the design isn’t very appealing, and the section’s existence confuses what we actually want to do.)


Implement map() on XComArg

To reduce overhead to convert upstream values into a zipped dict, we will implement additional methods on XComArg. This will be done by adding a “modifier” callable on XComArg, that is called before the downstream consumer is executed, when the XComArg is being resolved. This means we can rewrite the above to

def to_dict(v):
    return {"namespace": v[0], "image": v[1]}

kwargs = aggregated.map(to_dict)
KubernetesPodOperator.partial(task_id="kube").expand_kwargs(kwargs)
# Same result as above!

Note that to_dict in this version is not a task. This essentially “merges” the previous to_dict task into the KubernetesPodOperator task and eliminates the processes needed for the trivial operation.

It is also conceivable to implement other functional and iterating helpers, such as chain and flat_map(), to XComArg in the future.

Implement zip on XComArg

An optional optimisation is to also make zipping happen in the scheduler, by implementing the operation as a first-class syntax:

aggregated = up_1().zip(up_2())

Besides eliminating the intermediate aggregate task, it may also save memory in the worker, since the zip() function in the scheduler does not need to load the upstream return values in their entirety, but can pull only one individual value at a time (from each list) into the corresponding downstream mapped task. This, however, requires the XComArg implementation to be extended: currently an XComArg depends on exactly only one upstream; to support zipping, we must extend the implementation to allow depending on multiple upstream tasks (in this example, aggregated depends on both up_1 and up_2).

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
uranusjrcommented, Jun 21, 2022

Forgot to mention here: I also wrote some documentation additions for this, with some more useful examples: #24489. Feel free to propose any suggestions there as well.

0reactions
uranusjrcommented, Aug 8, 2022

This has been implemented.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] Tehada commented on issue #24021: Functional ...
... Functional helpers on XComArg and task expansion against 'zipped' inputs ... of them generating params for one pod and add pod creation...
Read more >
Concepts — Airflow Documentation - Apache Airflow
Outputs and inputs are sent between tasks using XCom values. In addition, you can wrap ... Calling a decorated function returns an XComArg...
Read more >
Proper way to create dynamic workflows in Airflow
So when I need to create tasks dynamically depending on information that is not held locally, I first use an operator to get...
Read more >
Release Notes - Apache Airflow documentation - Amazon AWS
Expanded dynamic task mapping support¶. Dynamic task mapping now includes support for expand_kwargs , zip and map . For more info on dynamic...
Read more >
https://raw.githubusercontent.com/apache/airflow/3...
For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A ... Calling a decorated...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found