question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DAG as classes causes a bad fileloc attribute and display wrong code in the UI after upgrade to 2.4.x

See original GitHub issue

Apache Airflow version

2.4.1

What happened

Hi everyone,

Being on Airflow 2.3.0, I am in the process of migrating to 2.4.1 and I am having an issue with the parsing of DAGs (which is affecting the UI).

In order to reuse code, we encapsulate DAGs in Python classes. It happens that a DAG inherits from another one to modify a behavior while preserving the original shape of the DAG as shown in the example below:

# airflow/app/dags/dummyA/dag.py
from datetime import datetime
from airflow.decorators import dag, task

class BaseDag:
    START_DATE = datetime(2022, 1, 1)

    def __init__(self, message: str):
        self.message = message

    def dag_wrapper(self, dag_id: str):
        @dag(dag_id=dag_id, start_date=self.START_DATE, catchup=False)
        def _base_dag():

            @task(task_id="print_message")
            def print_message(message: str):
                print(message)

            print_message(self.message)

        return _base_dag()

BaseDag("my message").dag_wrapper("BaseDag")
# airflow/app/dags/dummyB/dag.py
from app.dags.dummyA.dag import BaseDag

class ChildDag(BaseDag):

    def __init__(self, message: str):
        self.message = f"custom {message}"

ChildDag("my message").dag_wrapper("ChildDag")

We use an extremely basic configuration of Airflow with a containerized Postgres database, a container for the webserver and one for the scheduler (which uses the LocalExecutor).

During the airflow db init, I have the following error:

ERROR [airflow.models.dagbag.DagBag] Exception bagging dag: BaseDag
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 484, in _bag_dag
    raise AirflowDagDuplicatedIdException(
airflow.exceptions.AirflowDagDuplicatedIdException: Ignoring DAG BaseDag from /usr/local/airflow/app/dags/dummyB/dag.py - also found in /usr/local/airflow/app/dags/dummyA/dag.py
ERROR [airflow.models.dagbag.DagBag] Failed to bag_dag: /usr/local/airflow/app/dags/dummyB/dag.py
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 425, in _process_modules
    self.bag_dag(dag=dag, root_dag=dag)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 452, in bag_dag
    self._bag_dag(dag=dag, root_dag=root_dag, recursive=True)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/dagbag.py", line 484, in _bag_dag
    raise AirflowDagDuplicatedIdException(
airflow.exceptions.AirflowDagDuplicatedIdException: Ignoring DAG BaseDag from /usr/local/airflow/app/dags/dummyB/dag.py - also found in /usr/local/airflow/app/dags/dummyA/dag.py

This error does not impact the functioning of Airflow or my DAGs but when I go to the interface and look at the BaseDag code (it’s good for the ChildDag): image

This behavior is confirmed by checking the database (dag table): image

If I run the DAG, it is indeed the correct code that is executed and I have then the correct code displayed but if I reload the UI, it is again the code of the child class that is displayed.

What is surprising is that when I display the fileloc attribute of these two DAGs, it is the file path of the BaseDag that is displayed.

(In case I don’t do the airflow db init, I observe this same behavior on the interface.)

What you think should happen instead

The BaseDag code should be displayed (instead of the ChildDag one).

How to reproduce

Run airflow 2.4.1 instance with these two DAGs and you should see the wrong code in the UI (to display error logs, you just can run airflow db init).

Operating System

Docker’s image apache/airflow:2.4.1-python3.8 (Debian GNU/Linux 11 (bullseye))

Versions of Apache Airflow Providers

apache-airflow-providers-common-sql==1.2.0
apache-airflow-providers-docker==3.2.0
apache-airflow-providers-odbc==3.1.2
apache-airflow-providers-postgres==5.2.2

Deployment

Docker-Compose

Deployment details

used image: apache/airflow:2.4.1-python3.8 (Python 3.8)

  • a Postgres container (postgres:14.4)
  • an airflow init container (airflow db init; airflow db upgrade; airflow users create)
  • a scheduler (LocalExecutor)
  • a webserver (This is a simplified version of the official docker-compose.)

Anything else

This problem occurs every time and it happened when I upgraded from airflow 2.3.4 to 2.4.1, no other libraries were changed.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
tirkarthicommented, Oct 19, 2022

Since Airflow 2.4.0 has autoregister feature enabled by default the top level BaseDag is registered once dummyA is loaded. Once dummyB is loaded the dummyA/dag.py tries to import BaseDag from dummyA which gets autoregistered. During autoregister the mod is set to dummyB using current_autoregister_module_name attribute that is again assigned to the BaseDag which leads to incorrect fileloc and source code . https://github.com/apache/airflow/blob/a74d523ae1c6152ff4335e9c63ff418a6ae529c4/airflow/models/dagbag.py#L330

Couple of workarounds I can see before fixing this case :

  1. Move initialization of BaseDag from dummyA/dag.py to a separate dag file so that imports from other files don’t trigger this kind of case and also helps with better organization.
  2. Set auto_register as False in dag wrapper so that this is not triggered and assign the dags initialized to variables at top level so that they continue to work.

https://airflow.apache.org/blog/airflow-2.4.0/#auto-register-dags-used-in-a-context-manager-no-more-as-dag-needed

1reaction
MkSafavicommented, Oct 7, 2022

I’m facing the same issue in 2.4.1 in almost all of my dags. I have a few task_groups that are shared between dags. Importing them into another dag breaks the fileloc and shows the wrong .py file in UI. So I don’t think it’s related to your dag instance being inside a class. I found out about this issue after one of our servers failed to retry any dag run. (though I’m not sure if it’s related)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release Notes — Airflow Documentation
To create a DAG that runs whenever a Dataset is updated use the new schedule parameter (see below) and pass a list of...
Read more >
commits - The Mail Archive
[GitHub] [airflow] ashb closed issue #26936: DAG as classes causes a bad fileloc attribute and display wrong code in the UI after upgrade...
Read more >
apache-airflow Changelog - pyup.io
We've tried to make as few breaking changes as possible and to provide deprecation path in the code, especially in the case of...
Read more >
Magento 2.4.3 - unable to update attributes on multiple products
After upgrading my Magento from 2.4.2-p1 to 2.4.3, I cannot bulk edit multiple products. If I click Update Attributes, make a change.
Read more >
Changelog - Apache Airflow Documentation
Show Generic Error for Charts & Query View in old UI (#12495) ... Update Serialized DAGs in Webserver when DAGs are Updated (#9851)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found