question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Self hosted runners for GitHub actions fail very often on

See original GitHub issue

Very often, the self-hosted runners fail with this message:

The self-hosted runner: Airflow Runner 32 lost communication with the server. 
Verify the machine is running and has a healthy network connection. 
Anything in your workflow that terminates the runner process, starves it for CPU/Memory, 
or blocks its network access can cause this error. | 

Example failure: https://github.com/apache/airflow/actions/runs/584691417

It happened basically every time (and in many cases more than once) over the last few pushes I’ve done.

I think we need to get to the root cause of it - I suspect this might have something to do with scaling in/out the runners.

Happy to help solving it - I just need to have access to logs @ashb 😃.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Feb 21, 2021

It works much better now! Thanks ! Closing it.

1reaction
ashbcommented, Feb 20, 2021

I have been working on this slowly - my hypothesis is it’s a race condition: when the runner is busy it is protected from scale in, it finishes, gets un-protected from scale in, AWS starts terminating it, but before the instance terminates it picks up a new job. Right in time to get hard killed.

My in progress fix is to use a lifecycle hook to not get killed instantly.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Monitoring and troubleshooting self-hosted runners
You can monitor your self-hosted runners to view their activity and diagnose ... If you have any failing checks, you can see more...
Read more >
[Self-hosted] job abandoned #1546 - actions/runner - GitHub
Describe the bug Since yesterday, CI jobs keep failing. I tried to re-run the previously passed changes and still failed.
Read more >
Dealing with jobs failing with "lost communication with the ...
I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this...
Read more >
Checkout action randomly fails on self-hosted runner #333
This issue occurs randomly. Sometimes re-running the action fixes this. Any steps to debug the issue and find the root cause? The error...
Read more >
Workflow failure due to runner shutdown/stoppage · Issue #2040
Since 30 July 2022, our workflow fails with the following message: "The self-hosted runner: ***** lost communication with the server. Verify the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found