question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

State of this instance has been externally set to up_for_retry. Terminating instance.

See original GitHub issue

Apache Airflow version: 2.0.1

Kubernetes version (if you are using kubernetes) (use kubectl version): 1.18.14

Environment:

Cloud provider or hardware configuration: Azure OS (e.g. from /etc/os-release): Kernel (e.g. uname -a): Install tools: Others:

What happened:

An occasional airflow tasks fails with the following error

[2021-06-21 05:39:48,424] {local_task_job.py:184} WARNING - State of this instance has been externally set to up_for_retry. Terminating instance.
[2021-06-21 05:39:48,425] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 259
[2021-06-21 05:39:48,426] {taskinstance.py:1238} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-06-21 05:39:48,426] {bash.py:185} INFO - Sending SIGTERM signal to bash process group
[2021-06-21 05:39:49,133] {process_utils.py:66} INFO - Process psutil.Process(pid=329, status='terminated', started='04:32:14') (329) terminated with exit code None
[2021-06-21 05:39:50,278] {taskinstance.py:1454} ERROR - Task received SIGTERM signal
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1112, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1284, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1309, in _execute_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/bash.py", line 171, in execute
    for raw_line in iter(self.sub_process.stdout.readline, b''):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1240, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal

There is no indication as to what caused this error. The worker instance is healthy and task did not hit the task timeout.

What you expected to happen:

Task to complete successfully. If a task fad to fail for unavoidable reason (like timeout), it would be helpful to provide the reason for the failure.

How to reproduce it:

I’m not able to reproduce it consistently. It happens every now and then with the same error as provided above.

I’m also wish to know how to debug these failures

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
peaycommented, Jul 27, 2021

After more investigation, I was able to get to the bottom of it. I had missed the actual SQLAlchemy queries updating the task instance/task fail, as they are not performed by either the scheduler, web server or worker, but by the DAG processor (as part of file-level callbacks), whose logs are only available locally in my setup.

In my case, the scheduler thinks the job is a zombie. I had also missed that as the log message is at INFO level, and does not include the task ID nor the dag ID. Maybe the representation of the task instance should be used to provide more context and make searching logs easier?

[2021-07-27 09:54:44,249] {dag_processing.py:1156} INFO
   Detected zombie job:
   {'full_filepath': '/opt/dags/dags.py', 'msg': 'Detected as zombie', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7f7362fe5510>, 'is_failure_callback': True}

The scheduler is marking the job as failed in adopt_or_reset_orphaned_tasks, which marks as failed all jobs that have not sent a heartbeat in the last scheduler_health_check_threshold=30s.

This is apparently caused by a combination of two facts:

  • I had set job_heartbeat_sec to 30s, to avoid too much pressure on the database as my jobs are long.

  • Whenever the job heartbeats, it sets latest_heartbeat to 30s in the past, as shown in the following database logs. I am not sure if this is on purpose or a bug, but it certainly looks suspicious.

    2021-07-27 09:53:21 UTC  UPDATE job SET latest_heartbeat='2021-07-27T09:52:51.398617+00:00'::timestamptz WHERE job.id = 4451
    2021-07-27 09:53:51 UTC  UPDATE job SET latest_heartbeat='2021-07-27T09:53:21.480933+00:00'::timestamptz WHERE job.id = 4451
    2021-07-27 09:54:21 UTC  UPDATE job SET latest_heartbeat='2021-07-27T09:53:51.566774+00:00'::timestamptz WHERE job.id = 4451
    

In this example, adopt_or_reset_orphaned_tasks then ran 15s after the last heartbeat and did:

2021-07-27 09:54:36 UTC  UPDATE job SET state='failed' WHERE job.state = 'running' AND job.latest_heartbeat < '2021-07-27T09:54:06.555688+00:00'::timestamptz

From my limited understanding, there seem to be two issues:

  • latest_heartbeat is set to 30s in the past instead of the current time point, as I would expect.

  • Although setting job_heartbeat_sec=30s makes things worse by increasing the odds of this occurring, it seems like this can occur as soon as job_heartbeat_sec > 15s. Generally, it seems odd that job_heartbeat_sec is not used in adopt_or_reset_orphaned_tasks instead of scheduler_health_check_threshold. In particular, if job_heartbeat_sec is much larger than scheduler_health_check_threshold, then won’t adopt_or_reset_orphaned_tasks fail most jobs?

0reactions
crazyprogercommented, Sep 2, 2021

Just found failed task with exactly this message. Airflow 2.1.3. first task attempt end of log:

...
[2021-09-02 15:38:18,542] {python.py:151} INFO - Done. Returned value was: None
[2021-09-02 15:38:18,554] {taskinstance.py:1218} INFO - Marking task as SUCCESS. dag_id=<dag_id>, task_id=process, execution_date=20210809T150000, start_date=20210902T153811, end_date=20210902T153818
[2021-09-02 15:38:18,641] {local_task_job.py:151} INFO - Task exited with return code 0
[2021-09-02 15:38:18,651] {taskinstance.py:1512} INFO - Marking task as FAILED. dag_id=<dag_id>, task_id=process, execution_date=20210809T150000, start_date=20210902T153818, end_date=20210902T153818
[2021-09-02 15:38:18,749] {local_task_job.py:261} INFO - 0 downstream tasks scheduled from follow-on schedule check

Second attempt log(full):

*** Log file does not exist: /opt/airflow/logs/ <dag_id>/process/2021-08-09T15:00:00+00:00/2.log
*** Fetching from: http://<worker_host>:8793/log/ <dag_id>/process/2021-08-09T15:00:00+00:00/2.log

[2021-09-02 15:38:18,574] {taskinstance.py:903} INFO - Dependencies all met for <TaskInstance: <dag_id>.process 2021-08-09T15:00:00+00:00 [queued]>
[2021-09-02 15:38:18,603] {taskinstance.py:903} INFO - Dependencies all met for <TaskInstance:  <dag_id>.process 2021-08-09T15:00:00+00:00 [queued]>
[2021-09-02 15:38:18,603] {taskinstance.py:1094} INFO - 
--------------------------------------------------------------------------------
[2021-09-02 15:38:18,603] {taskinstance.py:1095} INFO - Starting attempt 2 of 2
[2021-09-02 15:38:18,604] {taskinstance.py:1096} INFO - 
--------------------------------------------------------------------------------
[2021-09-02 15:38:18,619] {taskinstance.py:1114} INFO - Executing <Task(PythonOperator): process> on 2021-08-09T15:00:00+00:00
[2021-09-02 15:38:18,625] {standard_task_runner.py:52} INFO - Started process 18717 to run task
[2021-09-02 15:38:18,628] {standard_task_runner.py:76} INFO - Running: ['airflow', 'tasks', 'run', ' <dag_id>', 'process', '2021-08-09T15:00:00+00:00', '--job-id', '46046825', '--pool', 'default_pool', '--raw', '--subdir', 'DAGS_FOLDER/datalocker/upload_dags/inapps_hourly_6.py', '--cfg-path', '/tmp/tmpl8uwjzk1', '--error-file', '/tmp/tmpiq4kx8a0']
[2021-09-02 15:38:18,629] {standard_task_runner.py:77} INFO - Job 46046825: Subtask process
[2021-09-02 15:38:23,682] {local_task_job.py:209} WARNING - State of this instance has been externally set to failed. Terminating instance.
[2021-09-02 15:38:23,684] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 18717
[2021-09-02 15:38:27,059] {process_utils.py:66} INFO - Process psutil.Process(pid=18717, status='terminated', exitcode=1, started='15:38:18') (18717) terminated with exit code 1
Read more comments on GitHub >

github_iconTop Results From Across the Web

WARNING - State of this instance has been externally set to ...
This error typically occurs when a task has timed out. You could increase the timeout in case your task takes a long time...
Read more >
State of this instance has been externally set to up_for_retry ...
Task has been running for a few minutes OK [2021-07-23 11:15:51,821] ... says task instance state is running, since we don't stop here....
Read more >
apache/incubator-airflow - Gitter
If you view the task instance details it should give some information on ... WARNING - State of this instance has been externally...
Read more >
Tasks taking the poison pills after few seconds of running
State set to NONE. ... WARNING - State of this instance has been externally set to up_for_retry. ... Terminating subprocesses.
Read more >
Troubleshooting DAGs | Cloud Composer
{local_task_job.py:211} WARNING - State of this instance has been externally set to queued. Terminating instance. {taskinstance.py:1411} ERROR - Received ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found