Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Names for expanded tasks

See original GitHub issue

Description

Airflow currently exposes map_index to the user as a way of distinguishing between tasks in an expansion. The index is unlikely to be meaningful to the user. They probably have their own label for this action. I’m requesting that we allow them to add that label.

To see the problem, consider a dag that sends email to a list of users which is generated at runtime:

with DAG(...) as dag:

    @dag.task
    def get_account_status():
        return [
            {
                "NAME": "Wintermute",
                "EMAIL": "wintermute@tessier-ashpool.com",
                "STATUS": "active",
            },
            {
                "NAME": "Hojo",
                "EMAIL": "ops@research.shinra.com",
                "STATUS": "delinquent",
            },
        ]

    BashOperator.partial(
        task_id="send_email",
        bash_command=dedent(
            """
            cat <<- EOF | tee | mailx -s "your account" $EMAIL
            Dear $NAME,
                Your account status is $STATUS.
            EOF
            """
        ),
    ).expand(env=get_account_status())

Notice that in the grid view, it’s not obvious which task goes with which user:

Use case/motivation

I’d like to be able to explicitly assign a name to each expanded task, that way I can later go look at the right one. I would like this name to be used (when available) anywhere that the user interacts with the expanded task.

In cases where the user provides no names, perhaps we can generate some. For instance, this expansion generates four instances.

BashOperator.partial(task_id="greet").expand(
    bash_command=["echo hello $USER", "echo goodbye $USER"],
    env=[{"USER": "foo"}, {"USER": "bar"}],
)

The friendliest way would be to use the requested feature name each task:

hi_foo
hi_bar
bye_foo
bye_bar

As it is, the user will see:

But if the user doesn’t give names, maybe we should generate some names for them:

bash_command_1_env_1
bash_command_1_env_2
bash_command_2_env_1
bash_command_2_env_2

I don’t know. I’m creating this issue so we have a place to discuss it.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project’s Code of Conduct

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:18 (18 by maintainers)

Top GitHub Comments

2reactions

uranusjrcommented, Aug 4, 2022

It just occurred to me that this is essentially a part of #22073. What we (users) actually want is a more customisable way to identify things (in this instance, a mapped task instance), and if we look past the assumption that a mapped task instance is “task_id + map_index”, we simply need a better way for the user to tell “what is this thing” in the Airflow UI. So let’s keep track of that issue instead to make sure whatever solution we come up for it correctly considers map_index.

2reactions

potiukcommented, Aug 3, 2022

I think you can forget about this.

You’ve just hit reality train (or rather reality train hit you 😃 )

Look there: https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-42+Dynamic+Task+Mapping - see this part:

“Rather than overloading the task_id argument to airflow tasks run (i.e. having a task_id of run_after_loop[0]) we will add a new --mapping-id argument to airflow tasks run – this value will be a ~JSON-encoded~ an integer specifying the index/position of the mapping.” (see also comments in the doc).

We have to support MySQL and the problem with MySQL is that index key size is limited. VERY limited. Depending on the type of encooding it might be even 760 characters or s. And task-id + dag_id + (string) task_index already exceed the limit by far. And there is no way around it - and this was the main reason (I believe) we had to use integer, even if originally we planned not even a name but JSON-encoded list of parameters - very similar to what you proposed ( which was far better for uniqueness - because it was automated).

But this is just what I saw - by observing it being implemented, so I might be wrong on that account - if that was the only or main reason for changing the original decision.