Tasks intermittently get terminated with SIGTERM on K8s executor
See original GitHub issueApache Airflow version:2.0.1
Kubernetes version (if you are using kubernetes) (use kubectl version
):1.18.10
Environment:
- Cloud provider or hardware configuration:Azure Kubernetes Service
- OS (e.g. from /etc/os-release):Debian GNU/Linux 10
- Kernel (e.g.
uname -a
):Linux airflow-k8s-scheduler-blabla-xvv5f 5.4.0-1039-azure #41~18.04.1-Ubuntu SMP Mon Jan 18 14:00:01 UTC 2021 x86_64 GNU/Linux - Install tools:
- Others:
What happened: tasks keep failing with sigterm
Task fails due to sigterm before it can finish-
[2021-03-08 20:39:14,503] {taskinstance.py:570} DEBUG - Refreshing TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [running]> from DB [2021-03-08 20:39:14,878] {taskinstance.py:605} DEBUG - Refreshed TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [running]> [2021-03-08 20:39:14,880] {base_job.py:219} DEBUG - [heartbeat] [2021-03-08 20:39:21,241] {taskinstance.py:570} DEBUG - Refreshing TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [running]> from DB [2021-03-08 20:39:21,670] {taskinstance.py:605} DEBUG - Refreshed TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [running]> [2021-03-08 20:39:21,672] {base_job.py:219} DEBUG - [heartbeat] [2021-03-08 20:39:27,736] {taskinstance.py:570} DEBUG - Refreshing TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [running]> from DB [2021-03-08 20:39:28,360] {taskinstance.py:605} DEBUG - Refreshed TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [None]> [2021-03-08 20:39:28,362] {local_task_job.py:188} WARNING - State of this instance has been externally set to None. Terminating instance. [2021-03-08 20:39:28,363] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 21 [2021-03-08 20:39:50,119] {taskinstance.py:1239} ERROR - Received SIGTERM. Terminating subprocesses. [2021-03-08 20:39:50,119] {taskinstance.py:570} DEBUG - Refreshing TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [running]> from DB [2021-03-08 20:39:51,328] {taskinstance.py:605} DEBUG - Refreshed TaskInstance <TaskInstance: xxxxx_xxxxxx_xxx 2021-03-08T19:20:00+00:00 [queued]> [2021-03-08 20:39:51,329] {taskinstance.py:1455} ERROR - Task received SIGTERM signal Traceback (most recent call last): File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task self._prepare_and_execute_task_with_callbacks(context, task) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1285, in _prepare_and_execute_task_with_callbacks result = self._execute_task(context, task_copy) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1315, in _execute_task result = task_copy.execute(context=context) File "/opt/airflow/mnt/dags/xyz/operators/xyz_data_sync_operator.py", line 73, in execute XyzUpsertClass(self.dest_conn_id, self.table, self.schema, src_res, drop_missing_columns=self.drop_missing_columns).execute() File "/opt/airflow/mnt/dags/xyz/operators/xyz_data_sync_operator.py", line 271, in execute cursor.execute( f"UPDATE {self.schema}.{self.table} SET {up_template} WHERE {pk_template}",row) File "/home/airflow/.local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1241, in signal_handler raise AirflowException("Task received SIGTERM signal") airflow.exceptions.AirflowException: Task received SIGTERM signal [2021-03-08 20:39:51,409] {taskinstance.py:1862} DEBUG - Task Duration set to 1055.564798 [2021-03-08 20:39:51,410] {taskinstance.py:1503} INFO - Marking task as FAILED. dag_id=xyz_sync task_id=xxxxx, execution_date=20210308T192000, start_date=20210308T202215, end_date=20210308T203951This also happens with a successful task-
[2021-03-08 21:46:01,466] {__init__.py:124} DEBUG - Preparing lineage inlets and outlets [2021-03-08 21:46:01,466] {__init__.py:168} DEBUG - inlets: [], outlets: [] [2021-03-08 21:46:01,467] {logging_mixin.py:104} INFO - {'conf': <airflow.configuration.AirflowConfigParser object at 0x7fc0f7f2a910>, 'dag': <DAG: canary_dag>, 'dag_run': <DagRun canary_dag @ 2021-03-08 20:10:00+00:00: scheduled__2021-03-08T20:10:00+00:00, externally triggered: False>, 'ds_nodash': '20210308', 'execution_date': DateTime(2021, 3, 8, 20, 10, 0, tzinfo=Timezone('+00:00')), 'inlets': [], 'macros': <module 'airflow.macros' from '/home/airflow/.local/lib/python3.7/site-packages/airflow/macros/__init__.py'>, 'next_ds': '2021-03-08', 'next_ds_nodash': '20210308', 'next_execution_date': DateTime(2021, 3, 8, 21, 10, 0, tzinfo=Timezone('UTC')), 'outlets': [], 'params': {}, 'prev_ds': '2021-03-08', 'prev_ds_nodash': '20210308', 'prev_execution_date': DateTime(2021, 3, 8, 19, 10, 0, tzinfo=Timezone('UTC')), 'prev_execution_date_success': <Proxy at 0x7fc0e3d0d780 with factory <function TaskInstance.get_template_context.<locals>.<lambda> at 0x7fc0e3d2ad40>>, 'prev_start_date_success': <Proxy at 0x7fc0e3d0d7d0 with factory <function TaskInstance.get_template_context.<locals>.<lambda> at 0x7fc0e3ff8830>>, 'run_id': 'scheduled__2021-03-08T20:10:00+00:00', 'task': <Task(PythonOperator): print_the_context>, 'task_instance': <TaskInstance: canary_dag.print_the_context 2021-03-08T20:10:00+00:00 [running]>, 'task_instance_key_str': 'canary_dag__print_the_context__20210308', 'test_mode': False, 'ti': <TaskInstance: canary_dag.print_the_context 2021-03-08T20:10:00+00:00 [running]>, 'tomorrow_ds': '2021-03-09', 'tomorrow_ds_nodash': '20210309', 'ts': '2021-03-08T20:10:00+00:00', 'ts_nodash': '20210308T201000', 'ts_nodash_with_tz': '20210308T201000+0000', 'var': {'json': None, 'value': None}, 'yesterday_ds': '2021-03-07', 'yesterday_ds_nodash': '20210307', 'templates_dict': None} [2021-03-08 21:46:01,467] {logging_mixin.py:104} INFO - 2021-03-08 [2021-03-08 21:46:01,467] {python.py:118} INFO - Done. Returned value was: Whatever you return gets printed in the logs [2021-03-08 21:46:21,690] {__init__.py:88} DEBUG - Lineage called with inlets: [], outlets: [] [2021-03-08 21:46:21,691] {taskinstance.py:570} DEBUG - Refreshing TaskInstance <TaskInstance: canary_dag.print_the_context 2021-03-08T20:10:00+00:00 [running]> from DB [2021-03-08 21:46:32,042] {taskinstance.py:605} DEBUG - Refreshed TaskInstance <TaskInstance: canary_dag.print_the_context 2021-03-08T20:10:00+00:00 [running]> [2021-03-08 21:46:32,051] {taskinstance.py:1166} INFO - Marking task as SUCCESS. dag_id=canary_dag, task_id=print_the_context, execution_date=20210308T201000, start_date=20210308T214431, end_date=20210308T214632 [2021-03-08 21:46:32,051] {taskinstance.py:1862} DEBUG - Task Duration set to 120.361368 [2021-03-08 21:46:38,577] {taskinstance.py:570} DEBUG - Refreshing TaskInstance <TaskInstance: canary_dag.print_the_context 2021-03-08T20:10:00+00:00 [running]> from DB [2021-03-08 21:46:51,150] {taskinstance.py:605} DEBUG - Refreshed TaskInstance <TaskInstance: canary_dag.print_the_context 2021-03-08T20:10:00+00:00 [success]> [2021-03-08 21:46:51,152] {local_task_job.py:188} WARNING - State of this instance has been externally set to success. Terminating instance. [2021-03-08 21:46:51,153] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 22 [2021-03-08 21:46:59,668] {taskinstance.py:1239} ERROR - Received SIGTERM. Terminating subprocesses. [2021-03-08 21:46:59,669] {cli_action_loggers.py:84} DEBUG - Calling callbacks: [] [2021-03-08 21:46:59,757] {process_utils.py:66} INFO - Process psutil.Process(pid=22, status='terminated', exitcode=1, started='21:44:41') (22) terminated with exit code 1 [2021-03-08 21:46:59,758] {base_job.py:219} DEBUG - [heartbeat]What you expected to happen: Tasks marked correctly instead of receiving sigterm
How to reproduce it: mostly occurs when a large number of dags (100+) are running on AKS
Anything else we need to know: DAGs that run longer or have more than 10-12 tasks (that may run long) seem to have a higher probability of this happening to them
Issue Analytics
- State:
- Created 3 years ago
- Reactions:14
- Comments:24 (8 by maintainers)
Top Results From Across the Web
Tasks intermittently gets terminated with SIGTERM on ... - GitHub
For me, I think this is happening when RUN_AS_USER is set for a task and the heartbeat is checked when the task instance...
Read more >[GitHub] [airflow] stroykova commented on issue #14672
... Tasks intermittently get terminated with SIGTERM on K8s executor ... using multiprocessing produces sigterm: ``` from pebble import ...
Read more >How To Fix Task received SIGTERM signal In Airflow
Pods are usually evicted due to resource-starved nodes and therefore this may be the reason why your Airflow task is receiving a SIGTERM...
Read more >Task is receiving SIGTERM signal if it runs more than 5 mins ...
I have airflow instance deployed on kubernetes cluster using bitnami helm chart with Kubernetes Executor. If I try to execute a task which ......
Read more >[GitHub] [airflow] saurabhladhe commented on issue #14672: Tasks ...
[GitHub] [airflow] saurabhladhe commented on issue #14672: Tasks intermittently get terminated with SIGTERM on K8s executor.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
tl;dr: set schedule_after_task_execution to false by either updating your airflow.cfg (recommended) or setting AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION to False.
We had the same issue and with the help of sentry to look through the whole stack trace, I found out why.
The buggy code block is taskinstance.py#L1182-L1201:
Looking at the code: after marking a task SUCCESS and commit, if it is not test mode, it will call a potentially expensive function _run_mini_scheduler_on_child_tasks. And local_task_job.py#L179-L199 will detect the task SUCCESS very soon and since the task is not running, it will terminate the process which might be still executing _run_mini_scheduler_on_child_tasks:
This is proven by the log @saurabhladhe shared (the line numbers diverge a bit because the log was logged by Airflow 2.0.1):
So the mitigation is to make _run_mini_scheduler_on_child_tasks cheap, which is an optimization controlled by schedule_after_task_execution and can be disabled.
The proper fix would start from changing
to
However, _run_mini_scheduler_on_child_tasks might got an OperationalError and roll back the session completely taskinstance.py#L1248. And we will falsely mark the task failure, though it is only the optimization failure. So I’d leave it to someone who knows the code better to fix it properly. Personally I’d suggest to remove this optimization completely.
I can confirm it solved our issues.