question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Heartbeat does not detect zombie processes when using Local Agent

See original GitHub issue

Description

I have been testing different situations where a task may fail by external causes (i.e.: I used a kill --9 command to kill the task process). I discovered that using a Local Agent lead to never detect a Zombie Process, neither using Prefect Cloud or locally on my Laptop. However, if I stop the Local Agent and restart it, then it detects the zombie process and works correctly, even rescheduled if using Prefect Cloud thanks to the Lazarus process.

To give more information, using the Docker Agent and kill the flow running docker kill <contained_id> it works correctly (after a few minutes it retries the flow again) and there is no need to restart the agent.

Expected Behavior

I expect that all the stuff done when restarting the Local Agent works correctly without that need.

Reproduction

Here I give you the flow definition that I used to test this:

import datetime
import time
import os
import prefect
from prefect import task, Flow


def append_result(result):
    with open("/tmp/file.txt", "a") as f:
        f.write(result)
        f.write("\n")

@task
def delete_file():
    try:
        os.remove('/tmp/file.txt')
    except:
        pass

@task(max_retries=5, retry_delay=datetime.timedelta(seconds=2), timeout=60)
def generate_file_simple():
    for i in range(10):
        time.sleep(1)
        append_result(f"{datetime.datetime.now()}: {i}. I am PID: {os.getpid()}")
        
        
with Flow("be-killed") as f:
    t1 = delete_file()
    t2 = generate_file_simple()

    # set dependency
    t2.set_upstream(t1)

# register flow in prefect cloud
with open('../prefect-cloud-user-token') as f:
    user_api_token = f.read().strip()

client = prefect.Client(api_token=user_api_token)
client.login_to_tenant(tenant_slug='XXXXX')
flow_run_id = client.create_flow_run(flow_id=flow_id)

When I see that the agent is running the task, and I verify that the file is being written, I kill the process by running, where the PID is being written in each line in the file being written:

import os
os.kill(XXXX,  9)

Environment

{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Darwin-19.4.0-x86_64-i386-64bit",
    "prefect_version": "0.11.2",
    "python_version": "3.7.7"
  }
}

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
jcozar87commented, Jun 22, 2020

Oh I see! Thank you very much! I guess that as local agent submits flow runs to run in a subprocess, the heatbeat should check the subprocess status as well.

However I find the Docker (and Docker Agent) more reliable in production. Therefore, and being the issue totally explainable, I will use Prefect very confident 😃

1reaction
madkinszcommented, Oct 25, 2022

@kevin868 There is a different issue #7239 for v2. There are no heartbeats in v2 at this time, but we would like to figure out a way to get this working.

Read more comments on GitHub >

github_iconTop Results From Across the Web

12c OEM Error: LongOpManager$ZombieDetection:1017
This happens when an EM Agent task such as collecting metrics is running more than the expected time, the process is marked as...
Read more >
Cortex XDR Agent leaving zombie processes with cortex-xdr ...
Cortex XDR Agent on Linux environment shows too much zombie processes. The process name is "cortex-xdr-payl" with in a zombie status.
Read more >
Troubleshoot missing heartbeats in Linux agents - Azure
Troubleshoot scenarios in which a Linux Log Analytics agent doesn't report heartbeats to the Log Analytics workspace.
Read more >
Local Control Agent - Genesys Documentation
On a Windows/Linux platform, LCA is sometimes unable to restart the application(s) when the Auto-. Restart option is enabled. Workaround: Use ...
Read more >
Agent
Large virtual memory utilization will also slow the system down. On UNIX machines, use the "top" command to see what processes are consuming...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found