Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Create custom DAG classes with custom `clear` etc.

See original GitHub issue

Description

I have a use case where we have a task which creates a Spark cluster for all other tasks to use. This task is the root of all tasks in the DAG so runs first when the DAG is kicked off. If anything fails in the DAG, I need to clear the failed task as well as the root task to create a cluster which can be reused. This results in me having to clear two tasks each time.

I would love to be able to override the clear method in DAG https://github.com/apache/airflow/blob/main/airflow/models/dag.py#L1821 in a custom DAG class to also always clear the root task. I tried doing this in my custom task but it didnt work. After some looking around, the reason is because of the way the DAGs are serialized to be stored in the metadata DB. They lose information about their original type so we are unable to overwrite any methods. This happens in this method: https://github.com/apache/airflow/blob/main/airflow/serialization/serialized_objects.py#L1000

Anyone have an easier way to accomplishing this?

One way to solve this would be to store the information about the class in this object - this code is adding serialize_dag["_class"] = dag.__class__.__module__ + '.' + dag.__class__.__name__ in the serialize_dag and updating https://github.com/apache/airflow/blob/main/airflow/serialization/schema.json#L109 to include _class. Then the issue comes down to deserializing the DAG which Im struggling with. We can run

from pydoc import locate
dag_class = locate(encoded_dag["_class"])
dag_class(dag.__dict__)

here https://github.com/apache/airflow/blob/main/airflow/serialization/serialized_objects.py#L1120 but Im getting tripped up on how to define the attributes we pass to the dag_class constructor. Any help is appreciated 🙏

Use case/motivation

I would like to be able to overwrite methods within the DAG class to customize the behaviour of my custom DAG classes which inherit from DAG

Related issues

I didnt find any 😦

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project’s Code of Conduct

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

uranusjrcommented, Jul 21, 2022

I cant figure out where we deserialize or recreate the callback methods for on_success/failure

The actual callbacks are not set on the deserialized/recreated DAG because it’s impossible to do without importing the original DAG files. Airflow only imports the DAG files in worker processes so user code is never run in the scheduler (which always only use deserialized DAGs). Thosee callbacks are instead executed by the DAG processor process, which imports the DAG files and thus has access to the actual Python function object.

1reaction

potiukcommented, Jul 21, 2022

Also to add - the very important part of this is security. We are working on improving the security of Airflow and we are gradually fixing all the potential places where - especially - DAG author can provide a code that can potentially “escape” the sandboxes we provide. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation introduced the possibiility of separation of the code provided by DAG authors from Scheduler. As of this chnage, you can only run the code provided by the DAG authors in a separate process, that can run on a separate machine. In the future it will be further isolated (so that the code execution from different teams happens in separate sendboxes, processes or even machines) - this way we want to introduce multi-tenancy.