State of this instance has been externally set to up_for_retry. Terminating instance.
See original GitHub issueApache Airflow version: 2.0.1
Kubernetes version (if you are using kubernetes) (use kubectl version
): 1.18.14
Environment:
Cloud provider or hardware configuration: Azure OS (e.g. from /etc/os-release): Kernel (e.g. uname -a): Install tools: Others:
What happened:
An occasional airflow tasks fails with the following error
[2021-06-21 05:39:48,424] {local_task_job.py:184} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2021-06-21 05:39:48,425] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 259
[2021-06-21 05:39:48,426] {taskinstance.py:1238} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-21 05:39:48,426] {bash.py:185} INFO - Sending SIGTERM signal to bash process group
[2021-06-21 05:39:49,133] {process_utils.py:66} INFO - Process psutil.Process(pid=329, status='terminated', started='04:32:14') (329) terminated with exit code None
[2021-06-21 05:39:50,278] {taskinstance.py:1454} ERROR - Task received SIGTERM signal
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1284, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1309, in _execute_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash.py", line 171, in execute
for raw_line in iter(self.sub_process.stdout.readline, b''):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1240, in signal_handler
raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
There is no indication as to what caused this error. The worker instance is healthy and task did not hit the task timeout.
What you expected to happen:
Task to complete successfully. If a task fad to fail for unavoidable reason (like timeout), it would be helpful to provide the reason for the failure.
How to reproduce it:
I’m not able to reproduce it consistently. It happens every now and then with the same error as provided above.
I’m also wish to know how to debug these failures
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
After more investigation, I was able to get to the bottom of it. I had missed the actual SQLAlchemy queries updating the task instance/task fail, as they are not performed by either the scheduler, web server or worker, but by the DAG processor (as part of file-level callbacks), whose logs are only available locally in my setup.
In my case, the scheduler thinks the job is a zombie. I had also missed that as the log message is at
INFO
level, and does not include the task ID nor the dag ID. Maybe the representation of the task instance should be used to provide more context and make searching logs easier?The scheduler is marking the job as
failed
inadopt_or_reset_orphaned_tasks
, which marks as failed all jobs that have not sent a heartbeat in the lastscheduler_health_check_threshold=30s
.This is apparently caused by a combination of two facts:
I had set
job_heartbeat_sec
to 30s, to avoid too much pressure on the database as my jobs are long.Whenever the job heartbeats, it sets
latest_heartbeat
to 30s in the past, as shown in the following database logs. I am not sure if this is on purpose or a bug, but it certainly looks suspicious.In this example,
adopt_or_reset_orphaned_tasks
then ran 15s after the last heartbeat and did:From my limited understanding, there seem to be two issues:
latest_heartbeat
is set to 30s in the past instead of the current time point, as I would expect.Although setting
job_heartbeat_sec=30s
makes things worse by increasing the odds of this occurring, it seems like this can occur as soon asjob_heartbeat_sec > 15s
. Generally, it seems odd thatjob_heartbeat_sec
is not used inadopt_or_reset_orphaned_tasks
instead ofscheduler_health_check_threshold
. In particular, ifjob_heartbeat_sec
is much larger thanscheduler_health_check_threshold
, then won’tadopt_or_reset_orphaned_tasks
fail most jobs?Just found failed task with exactly this message. Airflow 2.1.3. first task attempt end of log:
Second attempt log(full):