question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Create custom DAG classes with custom `clear` etc.

See original GitHub issue

Description

I have a use case where we have a task which creates a Spark cluster for all other tasks to use. This task is the root of all tasks in the DAG so runs first when the DAG is kicked off. If anything fails in the DAG, I need to clear the failed task as well as the root task to create a cluster which can be reused. This results in me having to clear two tasks each time.

I would love to be able to override the clear method in DAG https://github.com/apache/airflow/blob/main/airflow/models/dag.py#L1821 in a custom DAG class to also always clear the root task. I tried doing this in my custom task but it didnt work. After some looking around, the reason is because of the way the DAGs are serialized to be stored in the metadata DB. They lose information about their original type so we are unable to overwrite any methods. This happens in this method: https://github.com/apache/airflow/blob/main/airflow/serialization/serialized_objects.py#L1000

Anyone have an easier way to accomplishing this?

One way to solve this would be to store the information about the class in this object - this code is adding serialize_dag["_class"] = dag.__class__.__module__ + '.' + dag.__class__.__name__ in the serialize_dag and updating https://github.com/apache/airflow/blob/main/airflow/serialization/schema.json#L109 to include _class. Then the issue comes down to deserializing the DAG which Im struggling with. We can run

from pydoc import locate
dag_class = locate(encoded_dag["_class"])
dag_class(dag.__dict__)

here https://github.com/apache/airflow/blob/main/airflow/serialization/serialized_objects.py#L1120 but Im getting tripped up on how to define the attributes we pass to the dag_class constructor. Any help is appreciated 🙏

Use case/motivation

I would like to be able to overwrite methods within the DAG class to customize the behaviour of my custom DAG classes which inherit from DAG

Related issues

I didnt find any 😦

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
uranusjrcommented, Jul 21, 2022

I cant figure out where we deserialize or recreate the callback methods for on_success/failure

The actual callbacks are not set on the deserialized/recreated DAG because it’s impossible to do without importing the original DAG files. Airflow only imports the DAG files in worker processes so user code is never run in the scheduler (which always only use deserialized DAGs). Thosee callbacks are instead executed by the DAG processor process, which imports the DAG files and thus has access to the actual Python function object.

1reaction
potiukcommented, Jul 21, 2022

Also to add - the very important part of this is security. We are working on improving the security of Airflow and we are gradually fixing all the potential places where - especially - DAG author can provide a code that can potentially “escape” the sandboxes we provide. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-43+DAG+Processor+separation introduced the possibiility of separation of the code provided by DAG authors from Scheduler. As of this chnage, you can only run the code provided by the DAG authors in a separate process, that can run on a separate machine. In the future it will be further isolated (so that the code execution from different teams happens in separate sendboxes, processes or even machines) - this way we want to introduce multi-tenancy.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best Practices - Apache Airflow
Creating a new DAG is a three-step process: writing Python code to create a DAG object,. testing if the code meets our expectations,....
Read more >
Apache Airflow -- custom class to be used in multiple places
I want to use it within a custom operator (to modify its behaviour). But I also want to use it within a dag...
Read more >
Apache Airflow — A New Way To Write DAGs
For special jobs and custom task flows, simply make a new static method in the CreateDagOperator and call it in the DAG file....
Read more >
Securing Apache Airflow UI With DAG Level Access
We built a custom Airflow security manager, subclassing the FAB's built-in security manager in order to support DAG level permission access.
Read more >
Airflow DAG — Best Practices - Medium
Leverage Flask App Builder views to have DAG level access control. Set the DAG owner to correct Linux user. Create a custom role...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found