Self hosted runners for GitHub actions fail very often on
See original GitHub issueVery often, the self-hosted runners fail with this message:
The self-hosted runner: Airflow Runner 32 lost communication with the server.
Verify the machine is running and has a healthy network connection.
Anything in your workflow that terminates the runner process, starves it for CPU/Memory,
or blocks its network access can cause this error. |
Example failure: https://github.com/apache/airflow/actions/runs/584691417
It happened basically every time (and in many cases more than once) over the last few pushes I’ve done.
I think we need to get to the root cause of it - I suspect this might have something to do with scaling in/out the runners.
Happy to help solving it - I just need to have access to logs @ashb 😃.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Monitoring and troubleshooting self-hosted runners
You can monitor your self-hosted runners to view their activity and diagnose ... If you have any failing checks, you can see more...
Read more >[Self-hosted] job abandoned #1546 - actions/runner - GitHub
Describe the bug Since yesterday, CI jobs keep failing. I tried to re-run the previously passed changes and still failed.
Read more >Dealing with jobs failing with "lost communication with the ...
I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this...
Read more >Checkout action randomly fails on self-hosted runner #333
This issue occurs randomly. Sometimes re-running the action fixes this. Any steps to debug the issue and find the root cause? The error...
Read more >Workflow failure due to runner shutdown/stoppage · Issue #2040
Since 30 July 2022, our workflow fails with the following message: "The self-hosted runner: ***** lost communication with the server. Verify the ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It works much better now! Thanks ! Closing it.
I have been working on this slowly - my hypothesis is it’s a race condition: when the runner is busy it is protected from scale in, it finishes, gets un-protected from scale in, AWS starts terminating it, but before the instance terminates it picks up a new job. Right in time to get hard killed.
My in progress fix is to use a lifecycle hook to not get killed instantly.