Very slow queuing with plenty of idle runners available
See original GitHub issueOriginally posted here:
https://github.community/t/very-slow-queuing-behavior-when-idle-runners-are-available/127674
Describe the bug When no other builds are running (all my runners are idle), there is very delayed behavior from GH actions before builds even start. This is especially noticeable with, for example, a 4^4 matrix (256 checks), even if you have 256 idle self-hosted runners.
The UI shows “X queued checks” at a rate of around 4 per second (ie, “4 queued checks”, “8 queued checks”, etc), before it finally gets to 256 checks queued. It takes a full 1min 40sec before the first of my runners even receives a message and starts building. It takes 12-13min for the entire run to be marked as finished, even if each build does no work and completes in 1sec or less.
To Reproduce
- Register (and run) 256 self-hosted runners
- Run a workflow that uses a 4^4 matrix:
strategy:
matrix:
ix1: [ 0, 1, 2, 3 ]
ix2: [ 0, 1, 2, 3 ]
ix3: [ 0, 1, 2, 3 ]
ix4: [ 0, 1, 2, 3 ]
- Observe that it takes a long time before the first build message is sent to a runner
- Observe how, even after all checks are completed, it still takes many minutes (10?) for the entire workflow to be marked as finished
You can also observe similar behavior with fewer checks – eg, even just 16 runners and a 4^2 matrix. Even then, checks are queued before the first build will start – and there’s a noticeable delay after the 16th check has finished before the whole workflow is marked as complete. I see an overall run time of 1min10sec – even though each worker has completed its build in less than a second
Expected behavior
- There should be no queueing when idle runners are available
- Runners should receive build messages with no delay – within a few seconds at most.
- Workflows should be marked as finished soon after the final check is complete.
Runner Version and Platform
Checked 2.272.0 and 2.273.0
Checked on OSX and Linux
What’s not working?
This is not a problem with the runner (ie, the software in this repo) in any way as far as I can tell – it’s purely behavior on the service-side that’s causing this.
I suspect there’s some sort of serial process that just gets compounded the more parallel jobs you have (which defeats the purpose of having parallel jobs in the first place)
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:12 (1 by maintainers)
Top GitHub Comments
@hross This is still a problem over a year later - is work still being done here?
Our self-hosted runner infrastructure is scaling automatically, downscaling removes runners that have been idle for a certain amount of time. With sane idle time limits (~10 minutes), runners get terminated before jobs from our (ever growing) queue are assigned to them.
This issue costs us quite a bit of time and money.