question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Scheduler Loading DAGs That Have Not Changed

See original GitHub issue

Apache Airflow version

2.2.3 (latest released)

What happened

I have a few DAGs in my dag folder. I used git sync in order copy them into the dag folder. I saw the DAGs inside my dag folder, and I saw the last time they have been changed was Jan 24 (I used the ls -l /opt/airflow/dags/repo/ command in order to check that) Example for one DAG that I have in my dag folder: -rw-r--r-- 1 65533 root 4141 Jan 24 19:30 clear_missing_dags.py When I opened the logs of the scheduler inside the path /opt/airflow/logs/scheduler/latest/{my_dag_file}.log and the logs inside the /opt/airflow/logs/dag_processor_manager/dag_processor_manager.log I saw that the scheduler load the DAGs in Jan 25 even though they did not change. Example for logs from the scheduler logs:

[2022-01-25 09:46:18,615] {processor.py:654} INFO - DAG(s) dict_keys(['clear_missing_dags']) retrieved from /opt/airflow/dags/repo/clear_missing_dags.py
[2022-01-25 09:46:18,633] {logging_mixin.py:109} INFO - [2022-01-25 09:46:18,633] {dag.py:2396} INFO - Sync 1 DAGs
[2022-01-25 09:46:18,655] {logging_mixin.py:109} INFO - [2022-01-25 09:46:18,655] {dag.py:2935} INFO - Setting next_dagrun for clear_missing_dags to None
[2022-01-25 09:46:18,676] {processor.py:171} INFO - Processing /opt/airflow/dags/repo/clear_missing_dags.py took 0.186 seconds

Example for logs from the dag processor manager:

DAG File Processing Stats

File Path                                     PID    Runtime      # DAGs    # Errors  Last Runtime    Last Run
--------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
/opt/airflow/dags/repo/bash_example.py                                 0           1  0.15s           2022-01-25T09:48:20
/opt/airflow/dags/repo/branch_datetime.py                              0           1  0.15s           2022-01-25T09:48:26
/opt/airflow/dags/repo/python_example.py                               1           0  0.20s           2022-01-25T09:48:33
/opt/airflow/dags/repo/clear_missing_dags.py                           1           0  0.17s           2022-01-25T09:48:20
================================================================================
[2022-01-25 09:48:48,730] {manager.py:1065} INFO - Finding 'running' jobs without a recent heartbeat
[2022-01-25 09:48:48,731] {manager.py:1069} INFO - Failing jobs without heartbeat after 2022-01-25 09:43:48.731074+00:00

As far as I know, the scheduler checks if the dag has been change (by checking if the date of the file has been change from the last time we loaded the dag) I seems like this is not working.

What you expected to happen

I expected that the scheduler will not try to load the DAG again until we’ll change it.

How to reproduce

This happens on the default helm chart deployment (I used helm install airflow .). You can reproduce it by deploying the chart and creating a dag file inside the dag folder.

Operating System

Debian GNU/Linux 10 (buster)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Used the default values from the helm chart and only configured the git-sync option

Anything else

This problem happens each time we try to load DAGs. This cause the scheduler to run the cluster policies every X seconds instead of running it only when the DAG has changed

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Feb 9, 2022

This is as intendeed. Alll DAGs are parsed continuously. No matter if they changed or not - simply because re-parsing of the dag at different times can generate a different DAG (for example if the DAG reads an external file and creates DAG structure based on that the DAG might produce a different DAG if the external file changes. Same with importing external libraries.

Time of last modification of the DAG only matters for scheduling priority but each DAG will be re-parsed every min_file_process_interval seconds. This is how Airlfow works currently.

1reaction
avkirilishincommented, Jan 29, 2022

You can set min_file_process_interval to a large value (for example 86400), file_parsing_sort_mode to “modified_time” and default_timezone to “system”.

I think we have to change the min_file_process_interval description due to https://github.com/apache/airflow/commit/add7490145fabd097d605d85a662dccd02b600de

Number of seconds after which a DAG file is parsed. The DAG file is parsed every min_file_process_interval number of seconds. Updates to DAGs are reflected after this interval or after the DAG file modification if file_parsing_sort_mode is set to “modified_time”. Keeping this number low will increase CPU usage.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow not loading dags in /usr/local/airflow/dags
I find that I have to restart the scheduler for the UI to pick up the new dags, When I make changes to...
Read more >
7 Common Errors to Check When Debugging Airflow DAGs
7 Common Errors to Check When Debugging Airflow DAGs. Tasks not running? DAG stuck? Logs nowhere to be found? We've been there.
Read more >
Scheduler — Airflow Documentation - Apache Airflow
This changes the number of DAGs that are locked by each scheduler when creating DAG runs. One possible reason for setting this lower...
Read more >
Troubleshooting Airflow scheduler issues | Cloud Composer
In regular cases, Airflow scheduler should be able to deal with situations in which there are stale tasks in the queue and for...
Read more >
Performance tuning for Apache Airflow on Amazon MWAA
An issue with the Scheduler can prevent DAGs from being parsed and tasks from ... We recommend exercising caution when changing the default...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found