Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Self hosted runners for GitHub actions fail very often on

See original GitHub issue

Very often, the self-hosted runners fail with this message:

The self-hosted runner: Airflow Runner 32 lost communication with the server. 
Verify the machine is running and has a healthy network connection. 
Anything in your workflow that terminates the runner process, starves it for CPU/Memory, 
or blocks its network access can cause this error. |

Example failure: https://github.com/apache/airflow/actions/runs/584691417

It happened basically every time (and in many cases more than once) over the last few pushes I’ve done.

I think we need to get to the root cause of it - I suspect this might have something to do with scaling in/out the runners.

Happy to help solving it - I just need to have access to logs @ashb 😃.

Issue Analytics

State:
Created 3 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

potiukcommented, Feb 21, 2021

It works much better now! Thanks ! Closing it.

1reaction

ashbcommented, Feb 20, 2021

I have been working on this slowly - my hypothesis is it’s a race condition: when the runner is busy it is protected from scale in, it finishes, gets un-protected from scale in, AWS starts terminating it, but before the instance terminates it picks up a new job. Right in time to get hard killed.

My in progress fix is to use a lifecycle hook to not get killed instantly.