Access specific values within an `XCom` value using Taskflow API
See original GitHub issueDescription
Currently the output
property of operators doesn’t support accessing a specific value within an XCom
but rather the entire XCom
value. Ideally the behavior of calling the XComArg
via the output
property would function the same as the task_instance.xcom_pull()
method in which a user has immediate access the XCom
value and can directly access specific values in that XCom
.
For example, in the example DAG in the Apache Beam provider, the jobId
arg in the DataflowJobStatusSensor
task is a templated value using the task_instance.xcom_pull()
method and is then accessing the dataflow_job_id
key within the XCom
value:
start_python_job_dataflow_runner_async = BeamRunPythonPipelineOperator(
task_id="start_python_job_dataflow_runner_async",
runner="DataflowRunner",
py_file=GCS_PYTHON_DATAFLOW_ASYNC,
pipeline_options={
'tempLocation': GCS_TMP,
'stagingLocation': GCS_STAGING,
'output': GCS_OUTPUT,
},
py_options=[],
py_requirements=['apache-beam[gcp]==2.26.0'],
py_interpreter='python3',
py_system_site_packages=False,
dataflow_config=DataflowConfiguration(
job_name='{{task.task_id}}',
project_id=GCP_PROJECT_ID,
location="us-central1",
wait_until_finished=False,
),
)
wait_for_python_job_dataflow_runner_async_done = DataflowJobStatusSensor(
task_id="wait-for-python-job-async-done",
job_id="{{task_instance.xcom_pull('start_python_job_dataflow_runner_async')['dataflow_job_id']}}",
expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
project_id=GCP_PROJECT_ID,
location='us-central1',
)
There is no current, equivalent way to directly access the dataflow_job_id
value in same manner using the output
property.
Using start_python_job_dataflow_runner_async.output["dataflow_job_id"]
yields an equivalent task_instance.xcom_pull(task_ids='start_python_job_dataflow_runner_async', key='dataflow_job_id'
.
Or even start_python_job_dataflow_runner_async.output["return_value"]["dataflow_job_id"]
yields the same result: task_instance.xcom_pull(task_ids='start_python_job_dataflow_runner_async', key='dataflow_job_id'
.
It seems the only way to get the desired behavior currently is to hack around the __str__
method that’s available with XComArg
:
start_python_job_dataflow_runner_async_output = str(start_python_job_dataflow_runner_async.output).strip("{ }")
wait_for_python_job_dataflow_runner_async_done = DataflowJobStatusSensor(
task_id="wait-for-python-job-async-done",
job_id="{{{{ {start_python_job_dataflow_runner_async_output}['dataflow_job_id'] }}}}",
expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
project_id=GCP_PROJECT_ID,
location='us-central1',
)
This approach is not elegant, straightforward, nor user-friendly.
Use case / motivation
It’s functionally intuitive for users to have direct access to the specific values in an XCom
related to the XComArg
via the Taskflow API like the classic xcom_pull()
method. Ideally using an operator’s .output
property and the xcom_pull()
method would behave the same way when needing to pass the actual values between operators.
Are you willing to submit a PR? I would love to but I would certainly need some guidance on nuances here.
Related Issues https://github.com/apache/airflow/issues/10285
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:10 (6 by maintainers)
Top GitHub Comments
Without a dependency then you might try to get a value out of XCom that hasn’t been written yet!
Sure thing. The screenshot below was grabbed from this talk: https://www.youtube.com/watch?v=D_1VJapCscc&t=1055s
There is also some info here: https://dagster.io/blog/dagster-airflow