question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when auto-upgrading runner service on MacOS - service fails to restart after upgrade

See original GitHub issue

Describe the bug When attempting to auto-upgrade the runner service on MacOS, the runner service repeatedly fails to re-launch, and the machine has to be rebooted in order for it to launch succesfully.

To Reproduce Steps to reproduce the behavior:

  1. Install the runner (but don’t start it)
  2. Install the service (but don’t start it) via ./svc.sh install
  3. Ensure the service runs automatically upon boot without needing to log in (as per https://github.com/actions/runner/issues/349): sudo cp /Users/dbadmin/Library/LaunchAgents/actions.runner.diffblue-cover.ci-mac-04.plist /Library/LaunchDaemons/actions.runner.diffblue-cover.ci-mac-04.plist
  4. Reboot machine and check (via actions page) that github can see the runner is online
  5. Wait for an upgrade - the service will then go offline and not restart unless you reboot the machine manually.

Expected behavior

The service should upgrade and restart correctly (which I was able to observe happening fine when upgrading from 2.273.3 to 2.273.4):

2020-09-23 14:42:59Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.273.4 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
Runner listener exited with error code 3
Runner listener exit because of updating, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
√ Connected to GitHub
2020-09-23 14:46:49Z: Listening for Jobs

Runner Version and Platform

Upgrading 2.273.4 to 2.273.5 on Mac OS 10.15.5

Job Log Output

If applicable, include the relevant part of the job / step log output here. All sensitive information should already be masked out, but please double-check before pasting here.

Runner and Worker’s Diagnostic Logs

/Users/dbadmin/Library/Logs/actions.runner.diffblue-cover.ci-mac-04/stderr.og is empty

Output of /Users/dbadmin/Library/Logs/actions.runner.diffblue-cover.ci-mac-04/stdout.log

Runner update in progress, do not shutdown runner.
Downloading 2.273.5 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
Runner listener exited with error code 3
Runner listener exit because of updating, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Runner listener exited with error code null
Runner listener exit with undefined return code, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Runner listener exited with error code null
Runner listener exit with undefined return code, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Runner listener exited with error code null
Runner listener exit with undefined return code, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Runner listener exited with error code null
Runner listener exit with undefined return code, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Runner listener exited with error code null
Runner listener exit with undefined return code, re-launch runner in 5 seconds.
Starting Runner listener with startup type: service
Started listener process
Runner listener exited with error code null
Runner listener exit with undefined return code, re-launch runner in 5 seconds.

Up until I reboot the machine, whereupon stdout.txt changes to:

Shutting down runner listener
Sending SIGINT to runner listener to stop
.path=/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
Starting Runner listener with startup type: service
Started listener process
Started running service
2020-10-06 15:30:15Z: Runner connect error: nodename nor servname provided, or not known. Retrying until reconnected.

√ Connected to GitHub

2020-10-06 15:30:47Z: Runner reconnected.

and everything is normal.

This was the output of the relevant SelfUpdate log in _diag/

cat SelfUpdate-20201005-180911.log.succeed 
[2020-10-05 19:09:11-4N] --------whoami--------
dbadmin
[2020-10-05 19:09:11-4N] --------whoami--------
[2020-10-05 19:09:11-4N] Waiting for Runner.Listener (432) to complete
[2020-10-05 19:09:11-4N] Process 432 finished running
[2020-10-05 19:09:11-4N] Sleep 1 more second to make sure process exited
[2020-10-05 19:09:12-4N] Delete existing junction bin folder
[2020-10-05 19:09:12-4N] Delete existing junction externals folder
[2020-10-05 19:09:12-4N] Create junction bin folder
[2020-10-05 19:09:12-4N] Create junction externals folder
[2020-10-05 19:09:12-4N] Update succeed
[2020-10-05 19:09:12-4N] Rename /Users/dbadmin/actions-runner/_diag/SelfUpdate-20201005-180911.log to be /Users/dbadmin/actions-runner/_diag/SelfUpdate-20201005-180911.log.succeed
/Users/dbadmin/actions-runner/_diag/SelfUpdate-20201005-180911.log -> /Users/dbadmin/actions-runner/_diag/SelfUpdate-20201005-180911.log.succeed

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6

github_iconTop GitHub Comments

3reactions
thboopcommented, Sep 1, 2021

Thanks for the information!

Our team has been able to reproduce this issue and is working on a fix, hoping to roll it out with the next runner so that we don’t see this happen again. I appreciate the detailed writeup @kevindumanoir, it helped us narrow down the issue considerably.

3reactions
kevindumanoircommented, Oct 7, 2020

This is a tricky one. We do have the same issue at Fabernovel and here’s what we discovered :

  • The service watchdog is run from runsvc.sh. It is a node.js script invoked using a specific version of Node.JS, located at <YOUR_ACTIONS_RUNNER_PATH>/externals/node12/bin/node. There’s one important thing to remember : externals is a symbolic link
  • The service watchdog RunnerService.js fails to spawn subprocess because Apple System Policy, syspolicyd, considers <YOUR_ACTIONS_RUNNER_PATH>/externals.X.A/node12/bin/node as a malware. 🤯
  • node is considered as a malware, because the binary file, loaded in RAM, cannot be found on the hard drive.
  • node cannot be found because the binary doesn’t exist anymore: We do have invoked node from externals folder, but after the update the symbolic link has been updated : it now points to the new version of actions runner, i.e. externals.X.X. The thing is, when the service started the watchdog, it used node executable from externals.X.A. And yes, it seems this folder does not exists anymore.
  • This folder does not exist anymore because after every update, actions runner only keeps binaries from version n - 1. When updating to version n+1, folders externals-(n-1) and bin-(n-1) are deleted.

Conclusion : this issue occurs when, after launching service in runner actions version N, the runner auto-updates a second time.

So, how can we deal with it ? Obviously, Apple cannot rely on a symbolic path to check malware behavior of a program. Here is some non-exhaustive ideas:

  • Force node watchdog to exit when an update succeeded. It might be possible to let it exit with a specific status code, which will be interpreted as a watchdog reboot request from runsvc.sh
  • Keep externals.X.X folder for the currently used node process (not really a good idea imo)
  • Explain how to disable the syspolicyd (meh, not good either)
  • Rely on launchctl to keep Runner.Listener process alive (.i.e remove node watchdog), by eventually rely on SuccessfulExit
  • Let Runner exit with code 0 or 1 when an update finished successfully, so we can rely on implemented logic in watchdog. Or let error code 3 stop the watchdog
Read more comments on GitHub >

github_iconTop Results From Across the Web

If an error occurred while updating or installing macOS
Install after repairing your disk. Use Disk Utility to repair your startup disk. Then try updating or installing macOS again.
Read more >
Troubleshooting, Restoring and Restarting AutoUpgrade
Troubleshooting, Restoring and Restarting AutoUpgrade - Here you will find the options to do the right steps when an upgrade has failed.
Read more >
MID Server upgrades
Upgrade MID Servers manually, or automatically through the instance. MID Server automatic upgrade is triggered when the instance upgrades and the MID Server...
Read more >
gitlab runner doesn't start on boot on M1 Mac (#28389)
This error is due to where the LaunchAgent is configured to write files containing stdout and stderr. When installing the runner, it is ......
Read more >
"One or more Sophos services are missing or not running. ...
If the Sophos AutoUpdate service is not started or is missing, this needs to be resolved first. If this service is not started...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found