Scheduler Loading DAGs That Have Not Changed
See original GitHub issueApache Airflow version
2.2.3 (latest released)
What happened
I have a few DAGs in my dag folder. I used git sync in order copy them into the dag folder.
I saw the DAGs inside my dag folder, and I saw the last time they have been changed was Jan 24 (I used the ls -l /opt/airflow/dags/repo/
command in order to check that)
Example for one DAG that I have in my dag folder:
-rw-r--r-- 1 65533 root 4141 Jan 24 19:30 clear_missing_dags.py
When I opened the logs of the scheduler inside the path /opt/airflow/logs/scheduler/latest/{my_dag_file}.log
and the logs inside the /opt/airflow/logs/dag_processor_manager/dag_processor_manager.log
I saw that the scheduler load the DAGs in Jan 25 even though they did not change.
Example for logs from the scheduler logs:
[2022-01-25 09:46:18,615] {processor.py:654} INFO - DAG(s) dict_keys(['clear_missing_dags']) retrieved from /opt/airflow/dags/repo/clear_missing_dags.py
[2022-01-25 09:46:18,633] {logging_mixin.py:109} INFO - [2022-01-25 09:46:18,633] {dag.py:2396} INFO - Sync 1 DAGs
[2022-01-25 09:46:18,655] {logging_mixin.py:109} INFO - [2022-01-25 09:46:18,655] {dag.py:2935} INFO - Setting next_dagrun for clear_missing_dags to None
[2022-01-25 09:46:18,676] {processor.py:171} INFO - Processing /opt/airflow/dags/repo/clear_missing_dags.py took 0.186 seconds
Example for logs from the dag processor manager:
DAG File Processing Stats
File Path PID Runtime # DAGs # Errors Last Runtime Last Run
-------------------------------------------- ----- --------- -------- ---------- -------------- -------------------
/opt/airflow/dags/repo/bash_example.py 0 1 0.15s 2022-01-25T09:48:20
/opt/airflow/dags/repo/branch_datetime.py 0 1 0.15s 2022-01-25T09:48:26
/opt/airflow/dags/repo/python_example.py 1 0 0.20s 2022-01-25T09:48:33
/opt/airflow/dags/repo/clear_missing_dags.py 1 0 0.17s 2022-01-25T09:48:20
================================================================================
[2022-01-25 09:48:48,730] {manager.py:1065} INFO - Finding 'running' jobs without a recent heartbeat
[2022-01-25 09:48:48,731] {manager.py:1069} INFO - Failing jobs without heartbeat after 2022-01-25 09:43:48.731074+00:00
As far as I know, the scheduler checks if the dag has been change (by checking if the date of the file has been change from the last time we loaded the dag) I seems like this is not working.
What you expected to happen
I expected that the scheduler will not try to load the DAG again until we’ll change it.
How to reproduce
This happens on the default helm chart deployment (I used helm install airflow .
).
You can reproduce it by deploying the chart and creating a dag file inside the dag folder.
Operating System
Debian GNU/Linux 10 (buster)
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Used the default values from the helm chart and only configured the git-sync option
Anything else
This problem happens each time we try to load DAGs. This cause the scheduler to run the cluster policies every X seconds instead of running it only when the DAG has changed
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
This is as intendeed. Alll DAGs are parsed continuously. No matter if they changed or not - simply because re-parsing of the dag at different times can generate a different DAG (for example if the DAG reads an external file and creates DAG structure based on that the DAG might produce a different DAG if the external file changes. Same with importing external libraries.
Time of last modification of the DAG only matters for scheduling priority but each DAG will be re-parsed every
min_file_process_interval
seconds. This is how Airlfow works currently.You can set
min_file_process_interval
to a large value (for example 86400),file_parsing_sort_mode
to “modified_time” anddefault_timezone
to “system”.I think we have to change the
min_file_process_interval
description due to https://github.com/apache/airflow/commit/add7490145fabd097d605d85a662dccd02b600de