question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running tasks marked as 'orphaned' and killed by scheduler

See original GitHub issue

Apache Airflow version: 2.0.2, 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version): Server Version: version.Info{Major:“1”, Minor:“17+”, GitVersion:“v1.17.12-eks-7684af”, GitCommit:“7684af4ac41370dd109ac13817023cb8063e3d45”, GitTreeState:“clean”, BuildDate:“2020-10-20T22:57:40Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}

Environment:

  • Cloud provider or hardware configuration: AWS EKS
  • Others: Helm chart - 8.0.8, 8.1.0 Executor - CeleryExecutor

What happened:

When DAG is paused, and long PythonOperator tasks triggered manually (with “Ingnore all deps” - “run”), they are failing with error:

[2021-05-24 08:49:02,166] {logging_mixin.py:104} INFO - hi there, try 6, going to sleep for 15 secs
[2021-05-24 08:49:03,808] {local_task_job.py:188} WARNING - State of this instance has been externally set to None. Terminating instance.
[2021-05-24 08:49:03,810] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 172
[2021-05-24 08:49:03,812] {taskinstance.py:1265} ERROR - Received SIGTERM. Terminating subprocesses.

And in scheduler logs there’s message:

[2021-05-24 08:48:59,471] {scheduler_job.py:1854} INFO - Resetting orphaned tasks for active dag runs
[2021-05-24 08:48:59,485] {scheduler_job.py:1921} INFO - Reset the following 2 orphaned TaskInstances:
	<TaskInstance: timeout_testing.run_param_all 2021-05-23 13:46:13.840235+00:00 [running]>
	<TaskInstance: timeout_testing.sleep_well 2021-05-23 13:46:13.840235+00:00 [running]>

What you expected to happen:

These tasks are alive and well, and shouldn’t be killed 😃 Looks like something in reset_state_for_orphaned_tasks is wrongly marking running tasks as abandoned…

How to reproduce it:

dag = DAG(os.path.basename(__file__).replace('.py', ''),
          start_date=datetime(2021, 5, 11),
          schedule_interval=timedelta(days=1))

def sleep_tester(time_out, retries):
    for i in range(retries):
        print(f'hi there, try {i}, going to sleep for {time_out}')
        time.sleep(time_out)
        print("Aaah, good times, see ya soon")


sleeping = PythonOperator(task_id="sleep_well",
                          python_callable=sleep_tester,
                          op_kwargs={'time_out': 15, 'retries': 50},
                          dag=dag)

Create DAG with task above, verify it paused, trigger dag run manually from UI, then trigger the task manually. The task should fail after several tries.

Anything else we need to know: It might happen only if DAG never was unpaused (“ON”), though couldn’t verify it.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:3
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

3reactions
ephraimbuddycommented, Nov 23, 2021

Fixed in #19375, @vapiravfif can you test in 2.2.2 and reopen if it still happens

0reactions
vapiravfifcommented, Dec 1, 2021

Yep, not reproducing on 2.2.2, thank you!!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Triggered tasks from web-ui are killed by scheduler because ...
I expect that the manual run will not be killed by the scheduler as it thinks it is orphaned and thus that my...
Read more >
Why are my Airflow tasks being "externally set to failed"?
I guess that since the SchedulerJob was marked as failed, then the TaskInstance running my actual task was considered an orphan, ...
Read more >
Zombie and Undead tasks in Airflow | by Brihati Jain - Medium
For long running tasks, there is a greater probability that a task can be marked zombie especially with operators such as KubernetesPodOperator.
Read more >
Windows 2008 Task Scheduler Problem - TechNet - Microsoft
2 months ago, I created a task with the "hidden" property, thinking that the task is running in "hidden" mode, but it only...
Read more >
Hive-on-Spark tasks never finish - Cloudera Community - 52565
The tasks in question are all foreachAsync calls · All of the stalled tasks are running in the same executor · Even after...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found