Option to deserialize JSON from last log line in BashOperator and DockerOperator before sending to XCom
See original GitHub issueDescription
In order to create an XCom value with a BashOperator or a DockerOperator, we can use the option do_xcom_push
that pushes to XCom the last line of the command logs.
It would be interesting to provide an option xcom_json
to deserialize this last log line in case it’s a JSON string, before sending it as XCom. This would allow to access its attributes later in other tasks with the xcom_pull()
method.
Use case/motivation
See my StackOverflow post : https://stackoverflow.com/questions/74083466/how-to-deserialize-xcom-strings-in-airflow
Consider a DAG containing two tasks: DAG: Task A >> Task B
(BashOperators or DockerOperators). They need to communicate through XComs.
-
Task A
outputs the informations through a one-line json in stdout, which can then be retrieve in the logs ofTask A
, and so in its return_value XCom key ifxcom_push=True
. For instance :{"key1":1,"key2":3}
-
Task B
only needs thekey2
information fromTask A
, so we need to deserialize the return_value XCom ofTask A
to extract only this value and pass it directly toTask B
, using the jinja template{{xcom_pull('task_a')['key2']}}
. Using it as this results injinja2.exceptions.UndefinedError: 'str object' has no attribute 'key2'
because return_value is just a string.
For example we can deserialize Airflow Variables in jinja templates (ex: {{ var.json.my_var.path }}
). Globally I would like to do the same thing with XComs.
Current workaround:
We can create a custom Operator (inherited from BashOperator or DockerOperator) and augment the execute
method:
- execute the original
execute
method - intercepts the last log line of the task
- tries to
json.loads()
it in a Python dictionnary - finally return the output (which is now a dictionnary, not a string)
The previous jinja template {{ xcom_pull('task_a')['key2'] }}
is now working in task B
, since the XCom value is now a Python dictionnary.
class BashOperatorExtended(BashOperator):
def execute(self, context):
output = BashOperator.execute(self, context)
try:
output = json.loads(output)
except:
pass
return output
class DockerOperatorExtended(DockerOperator):
def execute(self, context):
output = DockerOperator.execute(self, context)
try:
output = json.loads(output)
except:
pass
return output
But creating a new operator just for that purpose is not really satisfying…
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
If the goal is to make Jinja2 templating simpler (there’s no issue if it’s taskflow), the simplest way may be to add a built-in macro for this?
Actually I think that could be made into a common “AbstractOperator” feature when I think of it. We could add “deserialize_output” parameter so that any operator can use it. I think we should even deserialize it using yaml, because then we will automatically handle both Yaml, and JSON (Yamlk is actually a 100% compatible superset of JSON - every proper JSON content is also a valid YAML).
WDYT @uranusjr ? I think having it as common “operator” feature (disabled by default) is quite a powerful feature that can maje a number of existing operators much easier to work witth.