question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scheduler gets stuck without a trace

See original GitHub issue

Apache Airflow version:

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others: What happened:

The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I’ve tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.

Scheduler configs, job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 executor = LocalExecutor parallelism = 32

Please help. I would be happy to provide any other information needed

What you expected to happen:

How to reproduce it:

Anything else we need to know:

Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:7
  • Comments:71 (42 by maintainers)

github_iconTop GitHub Comments

12reactions
MatthewRBrucecommented, Feb 18, 2021

We just saw this on 2.0.1 when we added a largish number of new DAGs (We’re adding around 6000 DAGs total, but this seems to lock up when about 200 try to be scheduled at once).

Here’s py-spy stacktraces from our scheduler:

Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:411)
    send (multiprocessing/connection.py:206)
    send_callback_to_execute (airflow/utils/dag_processing.py:283)
    _send_dag_callbacks_to_processor (airflow/jobs/scheduler_job.py:1795)
    _schedule_dag_run (airflow/jobs/scheduler_job.py:1762)
    _do_scheduling (airflow/jobs/scheduler_job.py:1521)
    _run_scheduler_loop (airflow/jobs/scheduler_job.py:1382)
    _execute (airflow/jobs/scheduler_job.py:1280)
    run (airflow/jobs/base_job.py:237)
    scheduler (airflow/cli/commands/scheduler_command.py:63)
    wrapper (airflow/utils/cli.py:89)
    command (airflow/cli/cli_parser.py:48)
    main (airflow/__main__.py:40)
    <module> (airflow:8)
 
Process 77: airflow scheduler -- DagFileProcessorManager
Python v3.8.7 (/usr/local/bin/python3.8)
Thread 0x7FF5C09C8740 (active): "MainThread"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:405)
    send (multiprocessing/connection.py:206)
    _run_parsing_loop (airflow/utils/dag_processing.py:698)
    start (airflow/utils/dag_processing.py:596)
    _run_processor_manager (airflow/utils/dag_processing.py:365)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:315)
    _launch (multiprocessing/popen_fork.py:75)
    __init__ (multiprocessing/popen_fork.py:19)
    _Popen (multiprocessing/context.py:277)
    start (multiprocessing/process.py:121)
    start (airflow/utils/dag_processing.py:248)
    _execute (airflow/jobs/scheduler_job.py:1276)
    run (airflow/jobs/base_job.py:237)
    scheduler (airflow/cli/commands/scheduler_command.py:63)
    wrapper (airflow/utils/cli.py:89)
    command (airflow/cli/cli_parser.py:48)
    main (airflow/__main__.py:40)
    <module> (airflow:8)

What I think is happening is that the pipe between the DagFileProcessorAgent and the DagFileProcessorManager is full and is causing the Scheduler to deadlock.

From what I can see the DagFileProcessorAgent only pulls data off the pipe in it’s heartbeat and wait_until_finished functions (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/utils/dag_processing.py#L374)

and that the SchedulerJob is responsible for calling it’s heartbeat function each scheduler loop (https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/scheduler_job.py#L1388).

However, the SchedulerJob is blocked from calling heartbeat because it’s blocked forever trying to send data to the same full pipe as part of the _send_dag_callbacks_to_processor in the _do_scheduling_ function causing a deadlock.

6reactions
leonsmithcommented, Mar 23, 2021

+1 on this issue.

Airflow 2.0.1

CeleryExecutor.

7000 dags~ seems to happen under load (when we have a bunch all dags all kick off at midnight)

py-spy dump --pid 132 --locals
Process 132: /usr/local/bin/python /usr/local/bin/airflow scheduler
Python v3.8.3 (/usr/local/bin/python)
Thread 132 (idle): "MainThread"
  _send (multiprocessing/connection.py:368)
      Arguments::
          self: <Connection at 0x7f5db7aac550>
          buf: <bytes at 0x5564f22e5260>
          write: <builtin_function_or_method at 0x7f5dbed8a540>
      Locals::
          remaining: 1213
  _send_bytes (multiprocessing/connection.py:411)
      Arguments::
          self: <Connection at 0x7f5db7aac550>
          buf: <memoryview at 0x7f5db66f4a00>
      Locals::
          n: 1209
          header: <bytes at 0x7f5dbc01fb10>
  send (multiprocessing/connection.py:206)
      Arguments::
          self: <Connection at 0x7f5db7aac550>
          obj: <TaskCallbackRequest at 0x7f5db7398940>
  send_callback_to_execute (airflow/utils/dag_processing.py:283)
      Arguments::
          self: <DagFileProcessorAgent at 0x7f5db7aac880>
          request: <TaskCallbackRequest at 0x7f5db7398940>
  _process_executor_events (airflow/jobs/scheduler_job.py:1242)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
          session: <Session at 0x7f5db80cf6a0>
      Locals::
          ti_primary_key_to_try_number_map: {("redeacted", "redeacted", <datetime.datetime at 0x7f5db768b540>): 1, ...}
          event_buffer: {...}
          tis_with_right_state: [("redeacted", "redeacted", <datetime.datetime at 0x7f5db768b540>, 1), ...]
          ti_key: ("redeacted", "redeacted", ...)
          value: ("failed", None)
          state: "failed"
          _: None
          filter_for_tis: <BooleanClauseList at 0x7f5db7427df0>
          tis: [<TaskInstance at 0x7f5dbbfd77c0>, <TaskInstance at 0x7f5dbbfd7880>, <TaskInstance at 0x7f5dbbfdd820>, ...]
          ti: <TaskInstance at 0x7f5dbbffba90>
          try_number: 1
          buffer_key: ("redeacted", ...)
          info: None
          msg: "Executor reports task instance %s finished (%s) although the task says its %s. (Info: %s) Was the task killed externally?"
          request: <TaskCallbackRequest at 0x7f5db7398940>
  wrapper (airflow/utils/session.py:62)
      Locals::
          args: (<SchedulerJob at 0x7f5dbed3dd00>)
          kwargs: {"session": <Session at 0x7f5db80cf6a0>}
  _run_scheduler_loop (airflow/jobs/scheduler_job.py:1386)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          is_unit_test: False
          call_regular_interval: <function at 0x7f5db7ac3040>
          loop_count: 1
          timer: <Timer at 0x7f5db76808b0>
          session: <Session at 0x7f5db80cf6a0>
          num_queued_tis: 17
  _execute (airflow/jobs/scheduler_job.py:1280)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          pickle_dags: False
          async_mode: True
          processor_timeout_seconds: 600
          processor_timeout: <datetime.timedelta at 0x7f5db7ab9300>
          execute_start_time: <datetime.datetime at 0x7f5db7727510>
  run (airflow/jobs/base_job.py:237)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          session: <Session at 0x7f5db80cf6a0>
  scheduler (airflow/cli/commands/scheduler_command.py:63)
      Arguments::
          args: <Namespace at 0x7f5db816f6a0>
      Locals::
          job: <SchedulerJob at 0x7f5dbed3dd00>
  wrapper (airflow/utils/cli.py:89)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          metrics: {"sub_command": "scheduler", "start_datetime": <datetime.datetime at 0x7f5db80f5db0>, ...}
  command (airflow/cli/cli_parser.py:48)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          func: <function at 0x7f5db8090790>
  main (airflow/__main__.py:40)
      Locals::
          parser: <DefaultHelpParser at 0x7f5dbec13700>
          args: <Namespace at 0x7f5db816f6a0>
  <module> (airflow:8)
py-spy dump --pid 134 --locals
Process 134: airflow scheduler -- DagFileProcessorManager
Python v3.8.3 (/usr/local/bin/python)
Thread 134 (idle): "MainThread"
  _send (multiprocessing/connection.py:368)
      Arguments::
          self: <Connection at 0x7f5db77274f0>
          buf: <bytes at 0x5564f1a76590>
          write: <builtin_function_or_method at 0x7f5dbed8a540>
      Locals::
          remaining: 2276
  _send_bytes (multiprocessing/connection.py:411)
      Arguments::
          self: <Connection at 0x7f5db77274f0>
          buf: <memoryview at 0x7f5db77d7c40>
      Locals::
          n: 2272
          header: <bytes at 0x7f5db6eb1f60>
  send (multiprocessing/connection.py:206)
      Arguments::
          self: <Connection at 0x7f5db77274f0>
          obj: (...)
  _run_parsing_loop (airflow/utils/dag_processing.py:698)
      Locals::
          poll_time: 0.9996239839999816
          loop_start_time: 690.422146969
          ready: [<Connection at 0x7f5db77274f0>]
          agent_signal: <TaskCallbackRequest at 0x7f5db678c8e0>
          sentinel: <Connection at 0x7f5db77274f0>
          processor: <DagFileProcessorProcess at 0x7f5db6eb1910>
          all_files_processed: False
          max_runs_reached: False
          dag_parsing_stat: (...)
          loop_duration: 0.0003760160000183532
  start (airflow/utils/dag_processing.py:596)
      Arguments::
          self: <DagFileProcessorManager at 0x7f5dbcb9c880>
  _run_processor_manager (airflow/utils/dag_processing.py:365)
      Arguments::
          dag_directory: "/code/src/dags"
          max_runs: -1
          processor_factory: <function at 0x7f5db7b30ee0>
          processor_timeout: <datetime.timedelta at 0x7f5db7ab9300>
          signal_conn: <Connection at 0x7f5db77274f0>
          dag_ids: []
          pickle_dags: False
          async_mode: True
      Locals::
          processor_manager: <DagFileProcessorManager at 0x7f5dbcb9c880>
  run (multiprocessing/process.py:108)
      Arguments::
          self: <ForkProcess at 0x7f5db7727220>
  _bootstrap (multiprocessing/process.py:315)
      Arguments::
          self: <ForkProcess at 0x7f5db7727220>
          parent_sentinel: 8
      Locals::
          util: <module at 0x7f5db8011e00>
          context: <module at 0x7f5dbcb8ba90>
  _launch (multiprocessing/popen_fork.py:75)
      Arguments::
          self: <Popen at 0x7f5db7727820>
          process_obj: <ForkProcess at 0x7f5db7727220>
      Locals::
          code: 1
          parent_r: 6
          child_w: 7
          child_r: 8
          parent_w: 9
  __init__ (multiprocessing/popen_fork.py:19)
      Arguments::
          self: <Popen at 0x7f5db7727820>
          process_obj: <ForkProcess at 0x7f5db7727220>
  _Popen (multiprocessing/context.py:276)
      Arguments::
          process_obj: <ForkProcess at 0x7f5db7727220>
      Locals::
          Popen: <type at 0x5564f1a439e0>
  start (multiprocessing/process.py:121)
      Arguments::
          self: <ForkProcess at 0x7f5db7727220>
  start (airflow/utils/dag_processing.py:248)
      Arguments::
          self: <DagFileProcessorAgent at 0x7f5db7aac880>
      Locals::
          mp_start_method: "fork"
          context: <ForkContext at 0x7f5dbcb9ce80>
          child_signal_conn: <Connection at 0x7f5db77274f0>
          process: <ForkProcess at 0x7f5db7727220>
  _execute (airflow/jobs/scheduler_job.py:1276)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          pickle_dags: False
          async_mode: True
          processor_timeout_seconds: 600
          processor_timeout: <datetime.timedelta at 0x7f5db7ab9300>
  run (airflow/jobs/base_job.py:237)
      Arguments::
          self: <SchedulerJob at 0x7f5dbed3dd00>
      Locals::
          session: <Session at 0x7f5db80cf6a0>
  scheduler (airflow/cli/commands/scheduler_command.py:63)
      Arguments::
          args: <Namespace at 0x7f5db816f6a0>
      Locals::
          job: <SchedulerJob at 0x7f5dbed3dd00>
  wrapper (airflow/utils/cli.py:89)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          metrics: {"sub_command": "scheduler", "start_datetime": <datetime.datetime at 0x7f5db80f5db0>, ...}
  command (airflow/cli/cli_parser.py:48)
      Locals::
          args: (<Namespace at 0x7f5db816f6a0>)
          kwargs: {}
          func: <function at 0x7f5db8090790>
  main (airflow/__main__.py:40)
      Locals::
          parser: <DefaultHelpParser at 0x7f5dbec13700>
          args: <Namespace at 0x7f5db816f6a0>
  <module> (airflow:8)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Scheduler freezing/hanging without a trace - Airflow
Symptoms. Scheduler stops doing work as expected while continuing to heartbeat; No tasks are scheduled or executed, task instances are not ...
Read more >
Spring Scheduler stops unexpectedly - java - Stack Overflow
Once you have a stack trace, find the scheduler thread and see what it is doing. Is it possible that the task it...
Read more >
Scheduled jobs may stall and fail to process if one job ...
The Scheduled Jobs screen will not accurate reflect the "stuck" status of the job (due to the way jobs and their run times...
Read more >
Troubleshooting Airflow scheduler issues | Cloud Composer
To check if you have tasks stuck in a queue, follow these steps. ... Low performance of the Airflow database might be the...
Read more >
How to fix a Job schedular which has been stuck for a while
Hi, I have a simple job scheduler in Test environment,which is being used ... Run is being set to some future date and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found