Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Re-deploy scheduler tasks failing with SIGTERM on K8s executor

See original GitHub issue

Apache Airflow version: 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.18.17-gke.1901

Environment:

Cloud provider or hardware configuration: Google Cloud
OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster)
Kernel (e.g. uname -a): Linux airflow-scheduler-7697b66974-m6mct 5.4.89+ #1 SMP Sat Feb 13 19:45:14 PST 2021 x86_64 GNU/Linux
Install tools:
Others:

What happened: When the scheduler is restarted the currently running tasks are facing SIGTERM error. Every time the scheduler is restarted or re-deployed then the current scheduler is terminated and a new scheduler is created. If during this process exist tasks running the new scheduler will terminate these tasks with complete status and new tasks will be created to continue the work of the terminated ones. After few seconds the new tasks are terminated with error status and SIGTERM error.

Error log

[2021-07-07 14:59:49,024] {cursor.py:661} INFO - query execution done [2021-07-07 14:59:49,025] {arrow_result.pyx:0} INFO - fetching data done [2021-07-07 15:00:07,361] {local_task_job.py:196} WARNING - State of this instance has been externally set to failed. Terminating instance. [2021-07-07 15:00:07,363] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 150 [2021-07-07 15:00:12,845] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses. [2021-07-07 15:00:12,907] {process_utils.py:66} INFO - Process psutil.Process(pid=150, status='terminated', exitcode=0, started='14:59:46') (150) terminated with exit code 0

What you expected to happen:

The tasks currently running should be allowed to finish their process or the substitute tasks should execute their process with success. The new scheduler should interfere with the running tasks.

How to reproduce it:

To reproduce is necessary to start a DAG that has some task(s) that take some minutes to be completed. During this task(s) processing a new deploy for scheduler should be executed. During the re-deploy, the current scheduler will be terminated and a new one will be created. The current task(s) will be completed (without finish their processing) and substituted for new ones that will fail in seconds.

Anything else we need to know:

The problem was not happening with Airflow 1.10.15 and it started to happens after the upgrade to Airflow 2.1.0.

Issue Analytics

State:
Created 2 years ago
Comments:11 (2 by maintainers)

Top GitHub Comments

2reactions

dberzanocommented, Sep 3, 2021

OK, let me retract - our problem was, seemingly, different. The serviceaccount provided to the scheduler was not bound to roles containing the verb patch for the pod kind… so the issue is solved for us.

@rodrigo-morais maybe it was the same issue for you, not sure. Apologies in advance for the noise, but let me share in case it helps you - In our case, the scheduler’s error message contained the following failed attempt to adopt pods on the new scheduler’s startup:

2021-09-03 13:31:57,269] {kubernetes_executor.py:663} INFO - attempting to adopt pod verylongpodname.37909dcabdcfe4598967b725b12ef92c
[2021-09-03 13:31:57,278] {kubernetes_executor.py:681} INFO - Failed to adopt pod verylongpodname.37909dcabdcfe4598967b725b12ef92c. Reason: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'aaaaaaaa-aaaa-aaaa-1111-bbbbbbbbbbbb', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 03 Sep 2021 13:31:57 GMT', 'Content-Length': '501'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"verylongpodname.37909dcabdcfe4598967b725b12ef92c\" is forbidden: User \"system:serviceaccount:ournamespace:ourserviceaccount\" cannot patch resource \"pods\" in API group \"\" in the namespace \"ournamespace\"","reason":"Forbidden","details":{"name":"verylongpodname.37909dcabdcfe4598967b725b12ef92c","kind":"pods"},"code":403}

We’ve edited the role resource (check with kubectl get role) by adding the patch verb to the pods kind, which was missing as the error message said quite clearly indeed.

1reaction

rodrigo-moraiscommented, Sep 3, 2021

Pods before to deploy a new scheduler

NAME                                                             READY   STATUS    RESTARTS   AGE
airflow-scheduler-5c989d8478-kvllt                               1/1     Running   0          3m57s
airflow-web-74f56b9df-hvpzz                                      2/2     Running   0          3m57s
ascheckcountssfvspgbusinessdb.89bf993a26f3431fa07667de82954239   1/1     Running   0          40s
ascheckcountssfvspgcontentdb.8cf4f8e0ae794b3dbc9b471c9aef4532    1/1     Running   0          40s
dbt-docs-647db48b7d-vpj4p                                        1/1     Running   0          3m57s

Pods after to deploy the scheduler

NAME                                                             READY   STATUS              RESTARTS   AGE
airflow-scheduler-5c989d8478-kvllt                               1/1     Running             0          4m6s
airflow-scheduler-776f94956b-s4dfv                               0/1     ContainerCreating   0          5s
airflow-web-5c667d85db-kfcx6                                     0/2     ContainerCreating   0          2s
airflow-web-74f56b9df-hvpzz                                      2/2     Running             0          4m6s
ascheckcountssfvspgbusinessdb.89bf993a26f3431fa07667de82954239   1/1     Running             0          49s
ascheckcountssfvspgcontentdb.8cf4f8e0ae794b3dbc9b471c9aef4532    1/1     Running             0          49s
dbt-docs-647db48b7d-vpj4p                                        1/1     Running             0          4m6s

Pods after the current scheduler have been terminated and the new one has been created

NAME                                                                       READY   STATUS      RESTARTS   AGE
airflow-scheduler-776f94956b-s4dfv                                         1/1     Running     0          3m6s
airflow-web-5c667d85db-kfcx6                                               2/2     Running     0          3m3s
ascheckcountssfvspgcontentdb.44eee838b6754b0aa0c8e6288a6195a2              0/1     Error       0          2m16s
ascheckcountssfvspgcontentdb.8cf4f8e0ae794b3dbc9b471c9aef4532              0/1     Completed   0          3m50s
dbt-docs-647db48b7d-vpj4p                                                  1/1     Running     0          7m7s

Pod with error

Running <TaskInstance: as_check_counts_sf_vs_pg.contentdb 2021-07-08T11:42:09.208095+00:00 [queued]> on host ascheckcountssfvspgcontentdb.44eee838b6754b0aa0c8e6288a6195a2
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 91, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 237, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
    run_job.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 133, in _execute
    self.heartbeat()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 218, in heartbeat
    self.heartbeat_callback(session=session)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 67, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 207, in heartbeat_callback
    ti._run_finished_callback(error=error)  # pylint: disable=protected-access
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1367, in _run_finished_callback
    task.on_failure_callback(context)

Pod with scheduler

/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py:850: DeprecationWarning: Specifying both AIRFLOW_HOME environment variable and airflow_home in the config file is deprecated. Please use only the AIRFLOW_HOME environment variable and remove the config file entry.
  warnings.warn(msg, category=DeprecationWarning)
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
[2021-07-08 11:44:16,768] {dagrun.py:429} ERROR - Marking run <DagRun as_check_counts_sf_vs_pg @ 2021-07-08 11:42:09.208095+00:00: manual__2021-07-08T11:42:09.208095+00:00, externally triggered: True> failed
[2021-07-08 11:44:26,873] {kubernetes_executor.py:202} ERROR - Event: ascheckcountssfvspgcontentdb.44eee838b6754b0aa0c8e6288a6195a2 Failed

Log of the task

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1137, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/airflow/plugins/operators/jw_pg_vs_sf_count_check.py", line 65, in execute
    cur_pg.execute(sql)
  File "/usr/local/lib/python3.8/encodings/utf_8.py", line 15, in decode
    def decode(input, errors='strict'):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1266, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2021-07-08 13:16:42,544] {taskinstance.py:1524} INFO - Marking task as FAILED. dag_id=as_check_counts_sf_vs_pg, task_id=contentdb, execution_date=20210708T114209, start_date=20210708T131622, end_date=20210708T131642
[2021-07-08 13:16:42,601] {process_utils.py:66} INFO - Process psutil.Process(pid=149, status='terminated', exitcode=1, started='13:16:21') (149) terminated with exit code 1

The UI: