question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Access specific values within an `XCom` value using Taskflow API

See original GitHub issue

Description Currently the output property of operators doesn’t support accessing a specific value within an XCom but rather the entire XCom value. Ideally the behavior of calling the XComArg via the output property would function the same as the task_instance.xcom_pull() method in which a user has immediate access the XCom value and can directly access specific values in that XCom.

For example, in the example DAG in the Apache Beam provider, the jobId arg in the DataflowJobStatusSensor task is a templated value using the task_instance.xcom_pull() method and is then accessing the dataflow_job_id key within the XCom value:

start_python_job_dataflow_runner_async = BeamRunPythonPipelineOperator(
        task_id="start_python_job_dataflow_runner_async",
        runner="DataflowRunner",
        py_file=GCS_PYTHON_DATAFLOW_ASYNC,
        pipeline_options={
            'tempLocation': GCS_TMP,
            'stagingLocation': GCS_STAGING,
            'output': GCS_OUTPUT,
        },
        py_options=[],
        py_requirements=['apache-beam[gcp]==2.26.0'],
        py_interpreter='python3',
        py_system_site_packages=False,
        dataflow_config=DataflowConfiguration(
            job_name='{{task.task_id}}',
            project_id=GCP_PROJECT_ID,
            location="us-central1",
            wait_until_finished=False,
        ),
    )

wait_for_python_job_dataflow_runner_async_done = DataflowJobStatusSensor(
        task_id="wait-for-python-job-async-done",
        job_id="{{task_instance.xcom_pull('start_python_job_dataflow_runner_async')['dataflow_job_id']}}",
        expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
        project_id=GCP_PROJECT_ID,
        location='us-central1',
    )

There is no current, equivalent way to directly access the dataflow_job_id value in same manner using the output property.

Using start_python_job_dataflow_runner_async.output["dataflow_job_id"] yields an equivalent task_instance.xcom_pull(task_ids='start_python_job_dataflow_runner_async', key='dataflow_job_id'.

Or even start_python_job_dataflow_runner_async.output["return_value"]["dataflow_job_id"] yields the same result: task_instance.xcom_pull(task_ids='start_python_job_dataflow_runner_async', key='dataflow_job_id'.

It seems the only way to get the desired behavior currently is to hack around the __str__ method that’s available with XComArg:

start_python_job_dataflow_runner_async_output = str(start_python_job_dataflow_runner_async.output).strip("{ }")

wait_for_python_job_dataflow_runner_async_done = DataflowJobStatusSensor(
        task_id="wait-for-python-job-async-done",
        job_id="{{{{ {start_python_job_dataflow_runner_async_output}['dataflow_job_id'] }}}}",
        expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
        project_id=GCP_PROJECT_ID,
        location='us-central1',
    )

This approach is not elegant, straightforward, nor user-friendly.

Use case / motivation It’s functionally intuitive for users to have direct access to the specific values in an XCom related to the XComArg via the Taskflow API like the classic xcom_pull() method. Ideally using an operator’s .output property and the xcom_pull() method would behave the same way when needing to pass the actual values between operators.

Are you willing to submit a PR? I would love to but I would certainly need some guidance on nuances here.

Related Issues https://github.com/apache/airflow/issues/10285

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ashbcommented, Aug 3, 2021

It would also be great if there was a way of NOT creating the dependencies between tasks automatically. For cluster or job IDs (EMR, Databricks, etc) is very common, and having these dependencies created automatically doesn’t make much sense.

Without a dependency then you might try to get a value out of XCom that hasn’t been written yet!

0reactions
ricardogaspar2commented, Aug 16, 2021

Not part of this topic, but it would be cool to have a visual representation of the variable that is being passed, much like Dagster does in their UI

@ricardogaspar2 Sounds useful – do you have any screenshot you could show?

Sure thing. The screenshot below was grabbed from this talk: https://www.youtube.com/watch?v=D_1VJapCscc&t=1055s

Screenshot 2021-08-16 at 11 17 09

There is also some info here: https://dagster.io/blog/dagster-airflow

Read more comments on GitHub >

github_iconTop Results From Across the Web

Working with TaskFlow — Airflow Documentation
This tutorial builds on the regular Airflow Tutorial and focuses specifically on writing data pipelines using the TaskFlow API paradigm which is introduced ......
Read more >
Pass data between tasks | Astronomer Documentation
XCom is a built-in Airflow feature. XComs allow tasks to exchange task metadata or small amounts of data. They are defined by a...
Read more >
Customizing Xcom for data sharing between tasks
Xcom with TaskFlow API. Greater Abstraction. - Return values implicitly use xcom ... As it stands, only the following datatypes are supported in...
Read more >
Airflow 2 Push Xcom with Key Name
You can just set ti in the decorator as: @task(task_id="task_one", ti) def get_height() -> int: response ...
Read more >
TaskFlow API in Apache Airflow 2.0 — Should You Use It?
new Airflow users no longer have to learn Airflow's specific operators to build their data pipelines,; you can finally pass data from one...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found