question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Automatic upgrades fail to re-connect with GitHub on Mac

See original GitHub issue

Issue

We have multiple organization-level self-hosted Mac runners that all experience the same issue every few weeks. After an auto-upgrade, the new runner restarts successfully but appears as offline from the https://github.com/organizations/NAME/settings/actions page and no workflows can be processed until the new runner is restarted via ./svc.sh stop && ./svc.sh start.

This happened today with the upgrade from v2.276.0->v2.276.1, and previously on 12/18. Unsure which version the 12/18 upgrade was from (the logs are gone).

To Reproduce

Steps to reproduce the behavior:

  1. Download an older runner
  2. Wait for the auto upgrade
  3. Note that the upgrade succeeds and the runner restarts (as evidenced by the ./svc.sh status output), but the organization Actions settings page shows the runner as offline and does not accept any new jobs

Expected behavior

Once the upgrade complete and the runner restarts, it should be connected back to GitHub to prevent any downtime.

Runner Version and Platform

  • v2.276.1 (upgraded from 2.276.0) (all hosts are macOS Catalina 10.15.6, and all hosts experience this)

What’s not working?

Self-hosted runner does not reconnect to GitHub after upgrade.

Runner and Worker’s Diagnostic Logs

Runner (before and after) and SelfUpdate logs attached Update Logs.zip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
kevindumanoircommented, Feb 2, 2021

@TingluoHuang It is the same error as described in https://github.com/actions/runner/issues/743 . There’s a related PR waiting for review or any improvements: https://github.com/actions/runner/pull/746

This is a tricky one. We do have the same issue at Fabernovel and here’s what we discovered :

  • The service watchdog is run from runsvc.sh. It is a node.js script invoked using a specific version of Node.JS, located at <YOUR_ACTIONS_RUNNER_PATH>/externals/node12/bin/node. There’s one important thing to remember : externals is a symbolic link
  • The service watchdog RunnerService.js fails to spawn subprocess because Apple System Policy, syspolicyd, considers <YOUR_ACTIONS_RUNNER_PATH>/externals.X.A/node12/bin/node as a malware. 🤯
  • node is considered as a malware, because the binary file, loaded in RAM, cannot be found on the hard drive.
  • node cannot be found because the binary doesn’t exist anymore: We do have invoked node from externals folder, but after the update the symbolic link has been updated : it now points to the new version of actions runner, i.e. externals.X.X. The thing is, when the service started the watchdog, it used node executable from externals.X.A. And yes, it seems this folder does not exists anymore.
  • This folder does not exist anymore because after every update, actions runner only keeps binaries from version n - 1. When updating to version n+1, folders externals-(n-1) and bin-(n-1) are deleted.

Conclusion : this issue occurs when, after launching service in runner actions version N, the runner auto-updates a second time.

So, how can we deal with it ? Obviously, Apple cannot rely on a symbolic path to check malware behavior of a program. Here is some non-exhaustive ideas:

  • Force node watchdog to exit when an update succeeded. It might be possible to let it exit with a specific status code, which will be interpreted as a watchdog reboot request from runsvc.sh
  • Keep externals.X.X folder for the currently used node process (not really a good idea imo)
  • Explain how to disable the syspolicyd (meh, not good either)
  • Rely on launchctl to keep Runner.Listener process alive (.i.e remove node watchdog), by eventually rely on SuccessfulExit
  • Let Runner exit with code 0 or 1 when an update finished successfully, so we can rely on implemented logic in watchdog. Or let error code 3 stop the watchdog
1reaction
TingluoHuangcommented, Feb 1, 2021

Added PR to trace process error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error when auto-upgrading runner service on MacOS ...
When attempting to auto-upgrade the runner service on MacOS, the runner service repeatedly fails to re-launch, and the machine has to be ...
Read more >
Auto-updater "Could not connect to the server" after ...
Auto -updater "Could not connect to the server" after downloading zip on macOS #6699. Closed. adriendurepaire opened this issue on Mar 4, ...
Read more >
Unable to "brew install" or "brew update" #2393
So I re-installed homebrew and got the same error as before, but this time it happens during the install, which also calls "brew...
Read more >
"Failed to connect to GitHub" error misleading, should print ...
After some digging it turns out that it's not a git command failing at all. It's the modified_since_commit() call, which is doing a...
Read more >
Git not working anymore on Mac after update to Ventura 13.0
I am using Github dekstop for the last few months to sync my source code to git. This worked till last week but...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found