question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Re-deploy scheduler tasks failing with SIGTERM on K8s executor

See original GitHub issue

Apache Airflow version: 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.18.17-gke.1901

Environment:

  • Cloud provider or hardware configuration: Google Cloud
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 10 (buster)
  • Kernel (e.g. uname -a): Linux airflow-scheduler-7697b66974-m6mct 5.4.89+ #1 SMP Sat Feb 13 19:45:14 PST 2021 x86_64 GNU/Linux
  • Install tools:
  • Others:

What happened: When the scheduler is restarted the currently running tasks are facing SIGTERM error. Every time the scheduler is restarted or re-deployed then the current scheduler is terminated and a new scheduler is created. If during this process exist tasks running the new scheduler will terminate these tasks with complete status and new tasks will be created to continue the work of the terminated ones. After few seconds the new tasks are terminated with error status and SIGTERM error.

Error log [2021-07-07 14:59:49,024] {cursor.py:661} INFO - query execution done [2021-07-07 14:59:49,025] {arrow_result.pyx:0} INFO - fetching data done [2021-07-07 15:00:07,361] {local_task_job.py:196} WARNING - State of this instance has been externally set to failed. Terminating instance. [2021-07-07 15:00:07,363] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 150 [2021-07-07 15:00:12,845] {taskinstance.py:1264} ERROR - Received SIGTERM. Terminating subprocesses. [2021-07-07 15:00:12,907] {process_utils.py:66} INFO - Process psutil.Process(pid=150, status='terminated', exitcode=0, started='14:59:46') (150) terminated with exit code 0

What you expected to happen:

The tasks currently running should be allowed to finish their process or the substitute tasks should execute their process with success. The new scheduler should interfere with the running tasks.

How to reproduce it:

To reproduce is necessary to start a DAG that has some task(s) that take some minutes to be completed. During this task(s) processing a new deploy for scheduler should be executed. During the re-deploy, the current scheduler will be terminated and a new one will be created. The current task(s) will be completed (without finish their processing) and substituted for new ones that will fail in seconds.

Anything else we need to know:

The problem was not happening with Airflow 1.10.15 and it started to happens after the upgrade to Airflow 2.1.0.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
dberzanocommented, Sep 3, 2021

OK, let me retract - our problem was, seemingly, different. The serviceaccount provided to the scheduler was not bound to roles containing the verb patch for the pod kind… so the issue is solved for us.

@rodrigo-morais maybe it was the same issue for you, not sure. Apologies in advance for the noise, but let me share in case it helps you - In our case, the scheduler’s error message contained the following failed attempt to adopt pods on the new scheduler’s startup:

2021-09-03 13:31:57,269] {kubernetes_executor.py:663} INFO - attempting to adopt pod verylongpodname.37909dcabdcfe4598967b725b12ef92c
[2021-09-03 13:31:57,278] {kubernetes_executor.py:681} INFO - Failed to adopt pod verylongpodname.37909dcabdcfe4598967b725b12ef92c. Reason: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'aaaaaaaa-aaaa-aaaa-1111-bbbbbbbbbbbb', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Fri, 03 Sep 2021 13:31:57 GMT', 'Content-Length': '501'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"verylongpodname.37909dcabdcfe4598967b725b12ef92c\" is forbidden: User \"system:serviceaccount:ournamespace:ourserviceaccount\" cannot patch resource \"pods\" in API group \"\" in the namespace \"ournamespace\"","reason":"Forbidden","details":{"name":"verylongpodname.37909dcabdcfe4598967b725b12ef92c","kind":"pods"},"code":403}

We’ve edited the role resource (check with kubectl get role) by adding the patch verb to the pods kind, which was missing as the error message said quite clearly indeed.

1reaction
rodrigo-moraiscommented, Sep 3, 2021

Pods before to deploy a new scheduler

NAME                                                             READY   STATUS    RESTARTS   AGE
airflow-scheduler-5c989d8478-kvllt                               1/1     Running   0          3m57s
airflow-web-74f56b9df-hvpzz                                      2/2     Running   0          3m57s
ascheckcountssfvspgbusinessdb.89bf993a26f3431fa07667de82954239   1/1     Running   0          40s
ascheckcountssfvspgcontentdb.8cf4f8e0ae794b3dbc9b471c9aef4532    1/1     Running   0          40s
dbt-docs-647db48b7d-vpj4p                                        1/1     Running   0          3m57s

Pods after to deploy the scheduler

NAME                                                             READY   STATUS              RESTARTS   AGE
airflow-scheduler-5c989d8478-kvllt                               1/1     Running             0          4m6s
airflow-scheduler-776f94956b-s4dfv                               0/1     ContainerCreating   0          5s
airflow-web-5c667d85db-kfcx6                                     0/2     ContainerCreating   0          2s
airflow-web-74f56b9df-hvpzz                                      2/2     Running             0          4m6s
ascheckcountssfvspgbusinessdb.89bf993a26f3431fa07667de82954239   1/1     Running             0          49s
ascheckcountssfvspgcontentdb.8cf4f8e0ae794b3dbc9b471c9aef4532    1/1     Running             0          49s
dbt-docs-647db48b7d-vpj4p                                        1/1     Running             0          4m6s

Pods after the current scheduler have been terminated and the new one has been created

NAME                                                                       READY   STATUS      RESTARTS   AGE
airflow-scheduler-776f94956b-s4dfv                                         1/1     Running     0          3m6s
airflow-web-5c667d85db-kfcx6                                               2/2     Running     0          3m3s
ascheckcountssfvspgcontentdb.44eee838b6754b0aa0c8e6288a6195a2              0/1     Error       0          2m16s
ascheckcountssfvspgcontentdb.8cf4f8e0ae794b3dbc9b471c9aef4532              0/1     Completed   0          3m50s
dbt-docs-647db48b7d-vpj4p                                                  1/1     Running     0          7m7s
Pod with error
Running <TaskInstance: as_check_counts_sf_vs_pg.contentdb 2021-07-08T11:42:09.208095+00:00 [queued]> on host ascheckcountssfvspgcontentdb.44eee838b6754b0aa0c8e6288a6195a2
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/__main__.py", line 40, in main
    args.func(args)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/cli_parser.py", line 48, in command
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/cli.py", line 91, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 237, in task_run
    _run_task_by_selected_method(args, dag, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 64, in _run_task_by_selected_method
    _run_task_by_local_task_job(args, ti)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/task_command.py", line 120, in _run_task_by_local_task_job
    run_job.run()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 237, in run
    self._execute()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 133, in _execute
    self.heartbeat()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 218, in heartbeat
    self.heartbeat_callback(session=session)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/session.py", line 67, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/local_task_job.py", line 207, in heartbeat_callback
    ti._run_finished_callback(error=error)  # pylint: disable=protected-access
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1367, in _run_finished_callback
    task.on_failure_callback(context)
Pod with scheduler
/home/airflow/.local/lib/python3.8/site-packages/airflow/configuration.py:850: DeprecationWarning: Specifying both AIRFLOW_HOME environment variable and airflow_home in the config file is deprecated. Please use only the AIRFLOW_HOME environment variable and remove the config file entry.
  warnings.warn(msg, category=DeprecationWarning)
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
[2021-07-08 11:44:16,768] {dagrun.py:429} ERROR - Marking run <DagRun as_check_counts_sf_vs_pg @ 2021-07-08 11:42:09.208095+00:00: manual__2021-07-08T11:42:09.208095+00:00, externally triggered: True> failed
[2021-07-08 11:44:26,873] {kubernetes_executor.py:202} ERROR - Event: ascheckcountssfvspgcontentdb.44eee838b6754b0aa0c8e6288a6195a2 Failed
Log of the task
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1137, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1341, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/airflow/plugins/operators/jw_pg_vs_sf_count_check.py", line 65, in execute
    cur_pg.execute(sql)
  File "/usr/local/lib/python3.8/encodings/utf_8.py", line 15, in decode
    def decode(input, errors='strict'):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1266, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
[2021-07-08 13:16:42,544] {taskinstance.py:1524} INFO - Marking task as FAILED. dag_id=as_check_counts_sf_vs_pg, task_id=contentdb, execution_date=20210708T114209, start_date=20210708T131622, end_date=20210708T131642
[2021-07-08 13:16:42,601] {process_utils.py:66} INFO - Process psutil.Process(pid=149, status='terminated', exitcode=1, started='13:16:21') (149) terminated with exit code 1

The UI: image

The previous and these logs don’t have any error in the DB.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Airflow 2.1.4 Task received SIGTERM signal randomly
Apparently, one of the causes is my new worker, because we didnt stop the scheduler when we add new worker, it causes this...
Read more >
Job execution pods in Kubernetes don't handle signals - GitLab
Summary Normally, when starting containers in Kubernetes there is a process that runs in the container, handling TERM/KILL signals sent by.
Read more >
Airflow with no downtime: An in-depth guide | by Agile Actors
It is a good idea to first shutdown your scheduler(s) so that no new tasks/DAGs can be scheduled for execution before stopping the...
Read more >
[GitHub] [airflow] atrbgithub commented on issue #13808
We see this after moving to 2.0.1, we have a dag which has a subdag, both tasks are launched using the k8s executor....
Read more >
Deploying and Scaling Microservices with Docker and ...
... at least minutes or hours) - CronJobs are great to schedule Jobs at regular ... on Kubernetes, without having to redeploy or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found