question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Airflow Scheduler liveness probe crashing (version 2.0)

See original GitHub issue

Apache Airflow version 2.0:

Kubernetes version 1.18.14

Environment: Azure - AKS

What happened:

I have just upgraded my Airflow from 1.10.13 to 2.0. I am running it in Kubernetes (AKS Azure) with Kubernetes Executor. Unfortunately, I see my Scheduler getting killed every 15-20 mins due to Liveness probe failing. Hence my pod keeps restarting.

Liveness probe

import os
os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'

from airflow.jobs.scheduler_job import SchedulerJob
from airflow.utils.db import create_session
from airflow.utils.net import get_hostname
import sys

with create_session() as session:
  job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
      SchedulerJob.latest_heartbeat.desc()).limit(1).first()

sys.exit(0 if job.is_alive() else 1)

Scheduler logs

[2021-02-16 12:18:22,422] {scheduler_job.py:933} DEBUG - No tasks to consider for execution.
[2021-02-16 12:18:22,426] {base_executor.py:147} DEBUG - 0 running task instances
[2021-02-16 12:18:22,426] {base_executor.py:148} DEBUG - 0 in queue
[2021-02-16 12:18:22,426] {base_executor.py:149} DEBUG - 32 open slots
[2021-02-16 12:18:22,427] {base_executor.py:158} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-16 12:18:22,427] {kubernetes_executor.py:337} DEBUG - Syncing KubernetesExecutor
[2021-02-16 12:18:22,427] {kubernetes_executor.py:263} DEBUG - KubeJobWatcher alive, continuing
[2021-02-16 12:18:22,439] {scheduler_job.py:1751} INFO - Resetting orphaned tasks for active dag runs
[2021-02-16 12:18:22,452] {settings.py:290} DEBUG - Disposing DB connection pool (PID 12819)
[2021-02-16 12:18:22,460] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor490-Process' pid=12819 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:23,009] {settings.py:290} DEBUG - Disposing DB connection pool (PID 12826)
[2021-02-16 12:18:23,017] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor491-Process' pid=12826 parent=9286 stopped exitcode=0>
[2021-02-16 12:18:23,594] {settings.py:290} DEBUG - Disposing DB connection pool (PID 12833)

... Many of these Disposing DB connection pool entries here

[2021-02-16 12:20:08,212] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor675-Process' pid=14146 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:08,916] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14153)
[2021-02-16 12:20:08,924] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor676-Process' pid=14153 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:09,475] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14160)
[2021-02-16 12:20:09,484] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor677-Process' pid=14160 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:10,044] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14167)
[2021-02-16 12:20:10,053] {scheduler_job.py:309} DEBUG - Waiting for <ForkProcess name='DagFileProcessor678-Process' pid=14167 parent=9286 stopped exitcode=0>
[2021-02-16 12:20:10,610] {settings.py:290} DEBUG - Disposing DB connection pool (PID 14180)
[2021-02-16 12:23:42,287] {scheduler_job.py:746} INFO - Exiting gracefully upon receiving signal 15
[2021-02-16 12:23:43,290] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 9286
[2021-02-16 12:23:43,494] {process_utils.py:201} INFO - Waiting up to 5 seconds for processes to exit...
[2021-02-16 12:23:43,503] {process_utils.py:61} INFO - Process psutil.Process(pid=14180, status='terminated', started='12:20:09') (14180) terminated with exit code None
[2021-02-16 12:23:43,503] {process_utils.py:61} INFO - Process psutil.Process(pid=9286, status='terminated', exitcode=0, started='12:13:35') (9286) terminated with exit code 0
[2021-02-16 12:23:43,506] {process_utils.py:95} INFO - Sending Signals.SIGTERM to GPID 9286
[2021-02-16 12:23:43,506] {scheduler_job.py:1296} INFO - Exited execute loop
[2021-02-16 12:23:43,523] {cli_action_loggers.py:84} DEBUG - Calling callbacks: []
[2021-02-16 12:23:43,525] {settings.py:290} DEBUG - Disposing DB connection pool (PID 7)

Scheduler deployment

---
################################
## Airflow Scheduler Deployment/StatefulSet
#################################
kind: Deployment
apiVersion: apps/v1
metadata:
  name: airflow-scheduler
  namespace: airflow
  labels:
    tier: airflow
    component: scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      tier: airflow
      component: scheduler
  template:
    metadata:
      labels:
        tier: airflow
        component: scheduler
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
    spec:
      nodeSelector:
        {}
      affinity:
        {}
      tolerations:
        []
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      serviceAccountName: airflow-scheduler
      securityContext:
        runAsUser: 50000
        fsGroup: 50000
      initContainers:
        - name: run-airflow-migrations
          image: apache/airflow:2.0.0-python3.8
          imagePullPolicy: IfNotPresent
          # Support running against 1.10.x and 2.0.0dev/master
          args: ["bash", "-c", "airflow db upgrade"]
          env:          
            # Dynamically created environment variables
            # Dynamically created secret envs
                      
            # Hard Coded Airflow Envs
            - name: AIRFLOW__CORE__FERNET_KEY
              valueFrom:
                secretKeyRef:
                  name: fernet-key
                  key: fernet-key
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              valueFrom:
                secretKeyRef:
                  name: airflow-airflow-metadata
                  key: connection
            - name: AIRFLOW_CONN_AIRFLOW_DB
              valueFrom:
                secretKeyRef:
                  name: airflow-airflow-metadata
                  key: connection
      containers:
        # Always run the main scheduler container.
        - name: scheduler
          image: apache/airflow:2.0.0-python3.8
          imagePullPolicy: Always
          args: ["bash", "-c", "exec airflow scheduler"]
          env:          
            # Dynamically created environment variables
            # Dynamically created secret envs
                      
            # Hard Coded Airflow Envs
            - name: AIRFLOW__CORE__FERNET_KEY
              valueFrom:
                secretKeyRef:
                  name: fernet-key
                  key: fernet-key
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              valueFrom:
                secretKeyRef:
                  name: airflow-airflow-metadata
                  key: connection
            - name: AIRFLOW_CONN_AIRFLOW_DB
              valueFrom:
                secretKeyRef:
                  name: airflow-airflow-metadata
                  key: connection
            - name: DEPENDENCIES
              value: "/opt/airflow/dags/repo/dags/dependencies/"
          # If the scheduler stops heartbeating for 5 minutes (10*30s) kill the
          # scheduler and let Kubernetes restart it
          livenessProbe:
            failureThreshold: 10
            periodSeconds: 30
            exec:
              command:
                - python
                - -Wignore
                - -c
                - |
                  import os
                  os.environ['AIRFLOW__CORE__LOGGING_LEVEL'] = 'ERROR'
                  os.environ['AIRFLOW__LOGGING__LOGGING_LEVEL'] = 'ERROR'

                  from airflow.jobs.scheduler_job import SchedulerJob
                  from airflow.utils.db import create_session
                  from airflow.utils.net import get_hostname
                  import sys

                  with create_session() as session:
                      job = session.query(SchedulerJob).filter_by(hostname=get_hostname()).order_by(
                          SchedulerJob.latest_heartbeat.desc()).limit(1).first()

                  sys.exit(0 if job.is_alive() else 1)
          resources:
            {}
          volumeMounts:
            - name: config
              mountPath: /opt/airflow/pod_templates/pod_template_file.yaml
              subPath: pod_template_file.yaml
              readOnly: true
            - name: logs
              mountPath: "/opt/airflow/logs"
            - name: config
              mountPath: "/opt/airflow/airflow.cfg"
              subPath: airflow.cfg
              readOnly: true
            - name: dags
              mountPath: /opt/airflow/dags
            - name: logs-conf
              mountPath: "/opt/airflow/config/log_config.py"
              subPath: log_config.py
              readOnly: true
            - name: logs-conf-ini
              mountPath: "/opt/airflow/config/__init__.py"
              subPath: __init__.py
              readOnly: true
        - name: git-sync
          image: "k8s.gcr.io/git-sync:v3.1.6"
          securityContext:
            runAsUser: 65533
          env:
            - name: GIT_SYNC_REV
              value: "HEAD"
            - name: GIT_SYNC_BRANCH
              value: "master"
            - name: GIT_SYNC_REPO
              value:  HIDDEN
            - name: GIT_SYNC_DEPTH
              value: "1"
            - name: GIT_SYNC_ROOT
              value: "/git"
            - name: GIT_SYNC_DEST
              value: "repo"
            - name: GIT_SYNC_ADD_USER
              value: "true"
            - name: GIT_SYNC_WAIT
              value: "60"
            - name: GIT_SYNC_MAX_SYNC_FAILURES
              value: "0"
            - name: GIT_SYNC_USERNAME
              valueFrom:
                secretKeyRef:
                  name: 'codecommit-key'
                  key: username
            - name: GIT_SYNC_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: 'codecommit-key'
                  key: password
          volumeMounts:
          - name: dags
            mountPath: /git
        # Always start the garbage collector sidecar.
        - name: scheduler-gc
          image: apache/airflow:2.0.0-python3.8
          imagePullPolicy: Always
          args: ["bash", "/clean-logs"]
          volumeMounts:
            - name: logs
              mountPath: "/opt/airflow/logs"
            - name: logs-conf
              mountPath: "/opt/airflow/config/log_config.py"
              subPath: log_config.py
              readOnly: true
            - name: logs-conf-ini
              mountPath: "/opt/airflow/config/__init__.py"
              subPath: __init__.py
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: airflow-airflow-config
        - name: dags
          emptyDir: {}
        - name: logs
          emptyDir: {}
        - name: logs-conf
          configMap:
            name: airflow-airflow-config
        - name: logs-conf-ini
          configMap:
            name: airflow-airflow-config

image

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:3
  • Comments:23 (7 by maintainers)

github_iconTop GitHub Comments

4reactions
stoiandlcommented, Feb 24, 2021

I managed to fix my restart by setting up the following configs:

[kubernetes]
...
delete_option_kwargs = {"grace_period_seconds": 10}
enable_tcp_keepalive = True
tcp_keep_idle = 30
tcp_keep_intvl = 30
tcp_keep_cnt = 30

I have another Airflow instance running in AWS - Kubernetes. That one runs fine with any version, I realized the problem is with Azure Kubernetes, the rest api calls to the api server.

3reactions
careduzcommented, Feb 23, 2021

We are facing the same issue (scheduler liveness probe always failing and restarting the scheduler). Details:

Airflow: Version 1.10.14 & 1.10.13 Kubernetes: Version 1.20.2 (DigitalOcean) Helm airflow-stable/airflow: Version 7.16.0

Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  27m                default-scheduler  Successfully assigned airflow/airflow-scheduler-75c6c96d68-r9j4m to apollo-kaon3thg1-882c2
  Normal   Pulled     27m                kubelet            Container image "alpine/git:latest" already present on machine
  Normal   Created    27m                kubelet            Created container git-clone
  Normal   Started    27m                kubelet            Started container git-clone
  Normal   Pulled     26m                kubelet            Container image "alpine/git:latest" already present on machine
  Normal   Created    26m                kubelet            Created container git-sync
  Normal   Started    26m                kubelet            Started container git-sync
  Normal   Killing    12m (x2 over 19m)  kubelet            Container airflow-scheduler failed liveness probe, will be restarted
  Normal   Pulled     11m (x3 over 26m)  kubelet            Container image "apache/airflow:1.10.14-python3.7" already present on machine
  Normal   Created    11m (x3 over 26m)  kubelet            Created container airflow-scheduler
  Normal   Started    11m (x3 over 26m)  kubelet            Started container airflow-scheduler
  Warning  Unhealthy  6m (x12 over 21m)  kubelet            Liveness probe failed:

And the logs are basically on a loop:

1] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor409-Process, stopped)>
[2021-02-23 22:58:35,578] {scheduler_job.py:1435} DEBUG - Starting Loop...
[2021-02-23 22:58:35,578] {scheduler_job.py:1446} DEBUG - Harvesting DAG parsing results
[2021-02-23 22:58:35,579] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:35,579] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:35,580] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:35,580] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:35,580] {scheduler_job.py:1448} DEBUG - Harvested 0 SimpleDAGs
[2021-02-23 22:58:35,581] {scheduler_job.py:1514} DEBUG - Heartbeating the executor
[2021-02-23 22:58:35,581] {base_executor.py:122} DEBUG - 0 running task instances
[2021-02-23 22:58:35,582] {base_executor.py:123} DEBUG - 0 in queue
[2021-02-23 22:58:35,582] {base_executor.py:124} DEBUG - 32 open slots
[2021-02-23 22:58:35,582] {base_executor.py:133} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-23 22:58:35,587] {scheduler_job.py:1469} DEBUG - Ran scheduling loop in 0.01 seconds
[2021-02-23 22:58:35,587] {scheduler_job.py:1472} DEBUG - Sleeping for 1.00 seconds
[2021-02-23 22:58:36,589] {scheduler_job.py:1484} DEBUG - Sleeping for 0.99 seconds to prevent excessive logging
[2021-02-23 22:58:36,729] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6719)
[2021-02-23 22:58:36,930] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6717)
[2021-02-23 22:58:37,258] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor410-Process, stopped)>
[2021-02-23 22:58:37,259] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor411-Process, stopped)>
[2021-02-23 22:58:37,582] {scheduler_job.py:1435} DEBUG - Starting Loop...
[2021-02-23 22:58:37,583] {scheduler_job.py:1446} DEBUG - Harvesting DAG parsing results
[2021-02-23 22:58:37,584] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:37,586] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:37,588] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:37,589] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:37,591] {scheduler_job.py:1448} DEBUG - Harvested 0 SimpleDAGs
[2021-02-23 22:58:37,592] {scheduler_job.py:1514} DEBUG - Heartbeating the executor
[2021-02-23 22:58:37,593] {base_executor.py:122} DEBUG - 0 running task instances
[2021-02-23 22:58:37,602] {base_executor.py:123} DEBUG - 0 in queue
[2021-02-23 22:58:37,604] {base_executor.py:124} DEBUG - 32 open slots
[2021-02-23 22:58:37,605] {base_executor.py:133} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-23 22:58:37,607] {scheduler_job.py:1460} DEBUG - Heartbeating the scheduler
[2021-02-23 22:58:37,620] {base_job.py:197} DEBUG - [heartbeat]
[2021-02-23 22:58:37,630] {scheduler_job.py:1469} DEBUG - Ran scheduling loop in 0.05 seconds
[2021-02-23 22:58:37,631] {scheduler_job.py:1472} DEBUG - Sleeping for 1.00 seconds
[2021-02-23 22:58:38,165] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6769)
[2021-02-23 22:58:38,268] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6765)
[2021-02-23 22:58:38,276] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor412-Process, started)>
[2021-02-23 22:58:38,284] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor413-Process, stopped)>
[2021-02-23 22:58:38,633] {scheduler_job.py:1484} DEBUG - Sleeping for 0.95 seconds to prevent excessive logging
[2021-02-23 22:58:39,331] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6797)
[2021-02-23 22:58:39,361] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6801)
[2021-02-23 22:58:39,589] {scheduler_job.py:1435} DEBUG - Starting Loop...
[2021-02-23 22:58:39,589] {scheduler_job.py:1446} DEBUG - Harvesting DAG parsing results
[2021-02-23 22:58:39,590] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:39,590] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:39,590] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:39,590] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:39,591] {scheduler_job.py:1448} DEBUG - Harvested 0 SimpleDAGs
[2021-02-23 22:58:39,591] {scheduler_job.py:1514} DEBUG - Heartbeating the executor
[2021-02-23 22:58:39,591] {base_executor.py:122} DEBUG - 0 running task instances
[2021-02-23 22:58:39,592] {base_executor.py:123} DEBUG - 0 in queue
[2021-02-23 22:58:39,593] {base_executor.py:124} DEBUG - 32 open slots
[2021-02-23 22:58:39,594] {base_executor.py:133} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-23 22:58:39,596] {scheduler_job.py:1469} DEBUG - Ran scheduling loop in 0.01 seconds
[2021-02-23 22:58:39,597] {scheduler_job.py:1472} DEBUG - Sleeping for 1.00 seconds
[2021-02-23 22:58:40,305] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor414-Process, stopped)>
[2021-02-23 22:58:40,306] {scheduler_job.py:280} DEBUG - Waiting for <ForkProcess(DagFileProcessor415-Process, stopped)>
[2021-02-23 22:58:40,599] {scheduler_job.py:1484} DEBUG - Sleeping for 0.99 seconds to prevent excessive logging
[2021-02-23 22:58:41,349] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6829)
[2021-02-23 22:58:41,386] {settings.py:310} DEBUG - Disposing DB connection pool (PID 6831)
[2021-02-23 22:58:41,595] {scheduler_job.py:1435} DEBUG - Starting Loop...
[2021-02-23 22:58:41,595] {scheduler_job.py:1446} DEBUG - Harvesting DAG parsing results
[2021-02-23 22:58:41,596] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:41,597] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:41,598] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:41,599] {dag_processing.py:658} DEBUG - Received message of type DagParsingStat
[2021-02-23 22:58:41,600] {scheduler_job.py:1448} DEBUG - Harvested 0 SimpleDAGs
[2021-02-23 22:58:41,601] {scheduler_job.py:1514} DEBUG - Heartbeating the executor
[2021-02-23 22:58:41,602] {base_executor.py:122} DEBUG - 0 running task instances
[2021-02-23 22:58:41,602] {base_executor.py:123} DEBUG - 0 in queue
[2021-02-23 22:58:41,604] {base_executor.py:124} DEBUG - 32 open slots
[2021-02-23 22:58:41,604] {base_executor.py:133} DEBUG - Calling the <class 'airflow.executors.kubernetes_executor.KubernetesExecutor'> sync method
[2021-02-23 22:58:41,607] {scheduler_job.py:1469} DEBUG - Ran scheduling loop in 0.01 seconds
[2021-02-23 22:58:41,608] {scheduler_job.py:1472} DEBUG - Sleeping for 1.00 seconds

EDIT: Tried it on Airflow 1.10.13 and same thing. Updated versions above.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow Scheduler liveness probe crashing (version 2.0)
I have another Airflow instance running in AWS - Kubernetes. That one runs fine with any version, I realized the problem is with...
Read more >
[GitHub] [airflow] madden1706 commented on issue #14261: Airflow ...
[GitHub] [airflow] madden1706 commented on issue #14261: Airflow Scheduler liveness probe crashing (version 2.0) · GitBox Tue, 30 Mar 2021 03:05:43 -0700.
Read more >
Airflow Scheduler liveness probe crashing (version 2.0)
Answer a question I have just upgraded my Airflow from 1.10.13 to 2.0. I am running it in Kubernetes (AKS Azure) with Kubernetes...
Read more >
[GitHub] [airflow] stoiandl opened a new issue #14261
[GitHub] [airflow] stoiandl opened a new issue #14261: Airflow Scheduler liveness probe crashing (version 2.0). Posted to commits@airflow.apache.org.
Read more >
[Example code]-Restarting the airflow scheduler
Coding example for the question Restarting the airflow scheduler. ... file and from Task Scheduler · Airflow Scheduler liveness probe crashing (version 2.0) ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found