Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very slow queuing with plenty of idle runners available

See original GitHub issue

Originally posted here:

https://github.community/t/very-slow-queuing-behavior-when-idle-runners-are-available/127674

Describe the bug When no other builds are running (all my runners are idle), there is very delayed behavior from GH actions before builds even start. This is especially noticeable with, for example, a 4^4 matrix (256 checks), even if you have 256 idle self-hosted runners.

The UI shows “X queued checks” at a rate of around 4 per second (ie, “4 queued checks”, “8 queued checks”, etc), before it finally gets to 256 checks queued. It takes a full 1min 40sec before the first of my runners even receives a message and starts building. It takes 12-13min for the entire run to be marked as finished, even if each build does no work and completes in 1sec or less.

To Reproduce

Register (and run) 256 self-hosted runners
Run a workflow that uses a 4^4 matrix:

    strategy:
      matrix:
        ix1: [ 0, 1, 2, 3 ]
        ix2: [ 0, 1, 2, 3 ]
        ix3: [ 0, 1, 2, 3 ]
        ix4: [ 0, 1, 2, 3 ]

Observe that it takes a long time before the first build message is sent to a runner
Observe how, even after all checks are completed, it still takes many minutes (10?) for the entire workflow to be marked as finished

You can also observe similar behavior with fewer checks – eg, even just 16 runners and a 4^2 matrix. Even then, checks are queued before the first build will start – and there’s a noticeable delay after the 16th check has finished before the whole workflow is marked as complete. I see an overall run time of 1min10sec – even though each worker has completed its build in less than a second

Expected behavior

There should be no queueing when idle runners are available
Runners should receive build messages with no delay – within a few seconds at most.
Workflows should be marked as finished soon after the final check is complete.

Runner Version and Platform

Checked 2.272.0 and 2.273.0

Checked on OSX and Linux

What’s not working?

This is not a problem with the runner (ie, the software in this repo) in any way as far as I can tell – it’s purely behavior on the service-side that’s causing this.

I suspect there’s some sort of serial process that just gets compounded the more parallel jobs you have (which defeats the purpose of having parallel jobs in the first place)

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:12 (1 by maintainers)

Top GitHub Comments

9reactions

direbunnycommented, Apr 22, 2022

@hross This is still a problem over a year later - is work still being done here?

2reactions

gierdocommented, Aug 25, 2020

Our self-hosted runner infrastructure is scaling automatically, downscaling removes runners that have been idle for a certain amount of time. With sane idle time limits (~10 minutes), runners get terminated before jobs from our (ever growing) queue are assigned to them.

This issue costs us quite a bit of time and money.