question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Very slow queuing with plenty of idle runners available

See original GitHub issue

Originally posted here:

https://github.community/t/very-slow-queuing-behavior-when-idle-runners-are-available/127674

Describe the bug When no other builds are running (all my runners are idle), there is very delayed behavior from GH actions before builds even start. This is especially noticeable with, for example, a 4^4 matrix (256 checks), even if you have 256 idle self-hosted runners.

The UI shows “X queued checks” at a rate of around 4 per second (ie, “4 queued checks”, “8 queued checks”, etc), before it finally gets to 256 checks queued. It takes a full 1min 40sec before the first of my runners even receives a message and starts building. It takes 12-13min for the entire run to be marked as finished, even if each build does no work and completes in 1sec or less.

To Reproduce

  1. Register (and run) 256 self-hosted runners
  2. Run a workflow that uses a 4^4 matrix:
    strategy:
      matrix:
        ix1: [ 0, 1, 2, 3 ]
        ix2: [ 0, 1, 2, 3 ]
        ix3: [ 0, 1, 2, 3 ]
        ix4: [ 0, 1, 2, 3 ]
  1. Observe that it takes a long time before the first build message is sent to a runner
  2. Observe how, even after all checks are completed, it still takes many minutes (10?) for the entire workflow to be marked as finished

You can also observe similar behavior with fewer checks – eg, even just 16 runners and a 4^2 matrix. Even then, checks are queued before the first build will start – and there’s a noticeable delay after the 16th check has finished before the whole workflow is marked as complete. I see an overall run time of 1min10sec – even though each worker has completed its build in less than a second

Expected behavior

  1. There should be no queueing when idle runners are available
  2. Runners should receive build messages with no delay – within a few seconds at most.
  3. Workflows should be marked as finished soon after the final check is complete.

Runner Version and Platform

Checked 2.272.0 and 2.273.0

Checked on OSX and Linux

What’s not working?

This is not a problem with the runner (ie, the software in this repo) in any way as far as I can tell – it’s purely behavior on the service-side that’s causing this.

I suspect there’s some sort of serial process that just gets compounded the more parallel jobs you have (which defeats the purpose of having parallel jobs in the first place)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:12 (1 by maintainers)

github_iconTop GitHub Comments

9reactions
direbunnycommented, Apr 22, 2022

@hross This is still a problem over a year later - is work still being done here?

2reactions
gierdocommented, Aug 25, 2020

Our self-hosted runner infrastructure is scaling automatically, downscaling removes runners that have been idle for a certain amount of time. With sane idle time limits (~10 minutes), runners get terminated before jobs from our (ever growing) queue are assigned to them.

This issue costs us quite a bit of time and money.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Allow configuring GitLab runner priority (#14976) · Issues
I have multiple runners for a project which run on very different machines, some slow (my dev machine), some fast (build server).
Read more >
Scaling GitHub Action Runners. Serverless scalable self ...
Messages on the queue are delayed for 30 seconds by default. This allows possible idle action runners to start the workflow execution.
Read more >
philips-labs/github-runner/aws
This check guarantees the event is sent by the GitHub App. The lambda only handles workflow_job events with status queued and matching the...
Read more >
How to Design Runners That Scale Using Gitlab-CI (Part 1)
There are cases where there are job spikes that overwhelm the capacity of your runners, resulting in lots of CI Jobs queued and...
Read more >
Scaling your Node.js app using distributed queues
Learn how to scale your app using Node.js distributed worker queues, which allow you to solve errors and prioritize and delay jobs easily....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found