DAG as classes causes a bad fileloc attribute and display wrong code in the UI after upgrade to 2.4.x
See original GitHub issueApache Airflow version
2.4.1
What happened
Hi everyone,
Being on Airflow 2.3.0, I am in the process of migrating to 2.4.1 and I am having an issue with the parsing of DAGs (which is affecting the UI).
In order to reuse code, we encapsulate DAGs in Python classes. It happens that a DAG inherits from another one to modify a behavior while preserving the original shape of the DAG as shown in the example below:
# airflow/app/dags/dummyA/dag.py
from datetime import datetime
from airflow.decorators import dag, task
class BaseDag:
START_DATE = datetime(2022, 1, 1)
def __init__(self, message: str):
self.message = message
def dag_wrapper(self, dag_id: str):
@dag(dag_id=dag_id, start_date=self.START_DATE, catchup=False)
def _base_dag():
@task(task_id="print_message")
def print_message(message: str):
print(message)
print_message(self.message)
return _base_dag()
BaseDag("my message").dag_wrapper("BaseDag")
# airflow/app/dags/dummyB/dag.py
from app.dags.dummyA.dag import BaseDag
class ChildDag(BaseDag):
def __init__(self, message: str):
self.message = f"custom {message}"
ChildDag("my message").dag_wrapper("ChildDag")
We use an extremely basic configuration of Airflow with a containerized Postgres database, a container for the webserver and one for the scheduler (which uses the LocalExecutor).
During the airflow db init, I have the following error:
ERROR [airflow.models.dagbag.DagBag] Exception bagging dag: BaseDag
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 484, in _bag_dag
raise AirflowDagDuplicatedIdException(
airflow.exceptions.AirflowDagDuplicatedIdException: Ignoring DAG BaseDag from /usr/local/airflow/app/dags/dummyB/dag.py - also found in /usr/local/airflow/app/dags/dummyA/dag.py
ERROR [airflow.models.dagbag.DagBag] Failed to bag_dag: /usr/local/airflow/app/dags/dummyB/dag.py
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 425, in _process_modules
self.bag_dag(dag=dag, root_dag=dag)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 452, in bag_dag
self._bag_dag(dag=dag, root_dag=root_dag, recursive=True)
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 484, in _bag_dag
raise AirflowDagDuplicatedIdException(
airflow.exceptions.AirflowDagDuplicatedIdException: Ignoring DAG BaseDag from /usr/local/airflow/app/dags/dummyB/dag.py - also found in /usr/local/airflow/app/dags/dummyA/dag.py
This error does not impact the functioning of Airflow or my DAGs but when I go to the interface and look at the BaseDag code (it’s good for the ChildDag):
This behavior is confirmed by checking the database (dag table):
If I run the DAG, it is indeed the correct code that is executed and I have then the correct code displayed but if I reload the UI, it is again the code of the child class that is displayed.
What is surprising is that when I display the fileloc
attribute of these two DAGs, it is the file path of the BaseDag
that is displayed.
(In case I don’t do the airflow db init
, I observe this same behavior on the interface.)
What you think should happen instead
The BaseDag
code should be displayed (instead of the ChildDag
one).
How to reproduce
Run airflow 2.4.1 instance with these two DAGs and you should see the wrong code in the UI (to display error logs, you just can run airflow db init
).
Operating System
Docker’s image apache/airflow:2.4.1-python3.8
(Debian GNU/Linux 11 (bullseye))
Versions of Apache Airflow Providers
apache-airflow-providers-common-sql==1.2.0
apache-airflow-providers-docker==3.2.0
apache-airflow-providers-odbc==3.1.2
apache-airflow-providers-postgres==5.2.2
Deployment
Docker-Compose
Deployment details
used image: apache/airflow:2.4.1-python3.8
(Python 3.8)
- a Postgres container (postgres:14.4)
- an airflow init container (
airflow db init; airflow db upgrade; airflow users create
) - a scheduler (
LocalExecutor
) - a webserver (This is a simplified version of the official docker-compose.)
Anything else
This problem occurs every time and it happened when I upgraded from airflow 2.3.4 to 2.4.1, no other libraries were changed.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:7 (5 by maintainers)
Since Airflow 2.4.0 has autoregister feature enabled by default the top level
BaseDag
is registered once dummyA is loaded. Once dummyB is loaded the dummyA/dag.py tries to import BaseDag from dummyA which gets autoregistered. During autoregister the mod is set to dummyB usingcurrent_autoregister_module_name
attribute that is again assigned to the BaseDag which leads to incorrect fileloc and source code . https://github.com/apache/airflow/blob/a74d523ae1c6152ff4335e9c63ff418a6ae529c4/airflow/models/dagbag.py#L330Couple of workarounds I can see before fixing this case :
BaseDag
from dummyA/dag.py to a separate dag file so that imports from other files don’t trigger this kind of case and also helps with better organization.auto_register
as False in dag wrapper so that this is not triggered and assign the dags initialized to variables at top level so that they continue to work.https://airflow.apache.org/blog/airflow-2.4.0/#auto-register-dags-used-in-a-context-manager-no-more-as-dag-needed
I’m facing the same issue in 2.4.1 in almost all of my dags. I have a few task_groups that are shared between dags. Importing them into another dag breaks the fileloc and shows the wrong .py file in UI. So I don’t think it’s related to your dag instance being inside a class. I found out about this issue after one of our servers failed to retry any dag run. (though I’m not sure if it’s related)