Create custom DAG classes with custom `clear` etc.
See original GitHub issueDescription
I have a use case where we have a task which creates a Spark cluster for all other tasks to use. This task is the root of all tasks in the DAG so runs first when the DAG is kicked off. If anything fails in the DAG, I need to clear the failed task as well as the root task to create a cluster which can be reused. This results in me having to clear two tasks each time.
I would love to be able to override the clear
method in DAG https://github.com/apache/airflow/blob/main/airflow/models/dag.py#L1821 in a custom DAG class to also always clear the root task. I tried doing this in my custom task but it didnt work. After some looking around, the reason is because of the way the DAGs are serialized to be stored in the metadata DB. They lose information about their original type so we are unable to overwrite any methods.
This happens in this method: https://github.com/apache/airflow/blob/main/airflow/serialization/serialized_objects.py#L1000
Anyone have an easier way to accomplishing this?
One way to solve this would be to store the information about the class in this object - this code is adding serialize_dag["_class"] = dag.__class__.__module__ + '.' + dag.__class__.__name__
in the serialize_dag
and updating https://github.com/apache/airflow/blob/main/airflow/serialization/schema.json#L109 to include _class
. Then the issue comes down to deserializing the DAG which Im struggling with.
We can run
from pydoc import locate
dag_class = locate(encoded_dag["_class"])
dag_class(dag.__dict__)
here https://github.com/apache/airflow/blob/main/airflow/serialization/serialized_objects.py#L1120 but Im getting tripped up on how to define the attributes we pass to the dag_class
constructor. Any help is appreciated 🙏
Use case/motivation
I would like to be able to overwrite methods within the DAG class to customize the behaviour of my custom DAG classes which inherit from DAG
Related issues
I didnt find any 😦
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:7 (3 by maintainers)
The actual callbacks are not set on the deserialized/recreated DAG because it’s impossible to do without importing the original DAG files. Airflow only imports the DAG files in worker processes so user code is never run in the scheduler (which always only use deserialized DAGs). Thosee callbacks are instead executed by the DAG processor process, which imports the DAG files and thus has access to the actual Python function object.
Also to add - the very important part of this is security. We are working on improving the security of Airflow and we are gradually fixing all the potential places where - especially - DAG author can provide a code that can potentially “escape” the sandboxes we provide. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation introduced the possibiility of separation of the code provided by DAG authors from Scheduler. As of this chnage, you can only run the code provided by the DAG authors in a separate process, that can run on a separate machine. In the future it will be further isolated (so that the code execution from different teams happens in separate sendboxes, processes or even machines) - this way we want to introduce multi-tenancy.