scheduler gets stuck without a trace
See original GitHub issueApache Airflow version:
Kubernetes version (if you are using kubernetes) (use kubectl version
):
Environment:
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others: What happened:
The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I’ve tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.
Scheduler configs, job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 executor = LocalExecutor parallelism = 32
Please help. I would be happy to provide any other information needed
What you expected to happen:
How to reproduce it:
Anything else we need to know:
Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401
Issue Analytics
- State:
- Created 3 years ago
- Reactions:7
- Comments:71 (42 by maintainers)
Top GitHub Comments
We just saw this on 2.0.1 when we added a largish number of new DAGs (We’re adding around 6000 DAGs total, but this seems to lock up when about 200 try to be scheduled at once).
Here’s py-spy stacktraces from our scheduler:
What I think is happening is that the pipe between the
DagFileProcessorAgent
and theDagFileProcessorManager
is full and is causing the Scheduler to deadlock.From what I can see the
DagFileProcessorAgent
only pulls data off the pipe in it’sheartbeat
andwait_until_finished
functions (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/utils/dag_processing.py#L374)and that the SchedulerJob is responsible for calling it’s
heartbeat
function each scheduler loop (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/scheduler_job.py#L1388).However, the SchedulerJob is blocked from calling
heartbeat
because it’s blocked forever trying to send data to the same full pipe as part of the_send_dag_callbacks_to_processor
in the_do_scheduling_
function causing a deadlock.+1 on this issue.
Airflow 2.0.1
CeleryExecutor.
7000 dags~ seems to happen under load (when we have a bunch all dags all kick off at midnight)
py-spy dump --pid 132 --locals
py-spy dump --pid 134 --locals