question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

snap auto-refresh breaks cluster

See original GitHub issue

This morning a close-to-production cluster fell over after snap’s auto-refresh “feature” failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd… For a production-ready Kubernetes distribution I really think this is a far from acceptable default… Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.

Otherwise microk8s is working rather well so thank you very much.

More details about the outage:

kubectl get nodes
NAME           STATUS     ROLES    AGE   VERSION
10.aa.aa.aaa   Ready      <none>   38d   v1.17.3
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   38d   v1.17.2
10.aa.aa.aaa   NotReady   <none>   18d   v1.17.2
aaa-master     Ready      <none>   59d   v1.17.3

microk8s is disabled…

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  disabled,classic
root@wk3:/home# snap changes microk8s
ID   Status  Spawn                Ready  Summary
20   Doing   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

Data copy appears hanged

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 09:56 AEDT  Ensure prerequisites for "microk8s" are available
Done    today at 09:56 AEDT  today at 09:56 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 09:56 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Mount snap "microk8s" (1254)
Done    today at 09:56 AEDT  today at 09:56 AEDT  Run pre-refresh hook of "microk8s" snap if present
Done    today at 09:56 AEDT  today at 09:57 AEDT  Stop snap "microk8s" services
Done    today at 09:56 AEDT  today at 09:57 AEDT  Remove aliases for snap "microk8s"
Done    today at 09:56 AEDT  today at 09:57 AEDT  Make current revision for snap "microk8s" unavailable
Doing   today at 09:56 AEDT  -                    Copy snap "microk8s" data
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" (1254) security profiles
Do      today at 09:56 AEDT  -                    Make snap "microk8s" (1254) available to the system
Do      today at 09:56 AEDT  -                    Automatically connect eligible plugs and slots of snap "microk8s"
Do      today at 09:56 AEDT  -                    Set automatic aliases for snap "microk8s"
Do      today at 09:56 AEDT  -                    Setup snap "microk8s" aliases
Do      today at 09:56 AEDT  -                    Run post-refresh hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Start snap "microk8s" (1254) services
Do      today at 09:56 AEDT  -                    Clean up "microk8s" (1254) install
Do      today at 09:56 AEDT  -                    Run configure hook of "microk8s" snap if present
Do      today at 09:56 AEDT  -                    Run health check of "microk8s" snap
Doing   today at 09:56 AEDT  -                    Consider re-refresh of "microk8s"

There doesn’t seem to be much to copy anyway:

root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M	/var/lib/snapd/
74G	/var/snap/
2.0G	/snap

root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M	/snap/microk8s/1176
737M	/snap/microk8s/1254

root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K	/var/snap/microk8s/1176
74G	/var/snap/microk8s/common

Starting microk8s fails

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

Fails to abort…

root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID   Status  Spawn                Ready  Summary
20   Abort   today at 09:56 AEDT  -      Auto-refresh snap "microk8s"

user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress

root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

snapd service hangs when trying to stop it…

root@wk2 ~# systemctl stop snapd.service
(hangs)

have to resort to manually stopping the process

killall snapd

finally change is undone…

root@wk3:/home# snap changes
ID   Status  Spawn                Ready                Summary
20   Undone  today at 09:56 AEDT  today at 10:41 AEDT  Auto-refresh snap "microk8s"

root@wk3:/home# snap tasks --last=auto-refresh
Status  Spawn                Ready                Summary
Done    today at 09:56 AEDT  today at 10:41 AEDT  Ensure prerequisites for "microk8s" are available
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Download snap "microk8s" (1254) from channel "1.17/stable"
Done    today at 09:56 AEDT  today at 10:41 AEDT  Fetch and check assertions for snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Mount snap "microk8s" (1254)
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Run pre-refresh hook of "microk8s" snap if present
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Stop snap "microk8s" services
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Remove aliases for snap "microk8s"
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Make current revision for snap "microk8s" unavailable
Undone  today at 09:56 AEDT  today at 10:41 AEDT  Copy snap "microk8s" data
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" (1254) security profiles
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Make snap "microk8s" (1254) available to the system
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Automatically connect eligible plugs and slots of snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Set automatic aliases for snap "microk8s"
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Setup snap "microk8s" aliases
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run post-refresh hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Start snap "microk8s" (1254) services
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Clean up "microk8s" (1254) install
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run configure hook of "microk8s" snap if present
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Run health check of "microk8s" snap
Hold    today at 09:56 AEDT  today at 10:30 AEDT  Consider re-refresh of "microk8s

root@wk3:/home# snap list
Name      Version    Rev   Tracking  Publisher   Notes
core      16-2.43.3  8689  stable    canonical✓  core
kubectl   1.17.3     1424  1.17      canonical✓  classic
microk8s  v1.17.2    1176  1.17      canonical✓  classic

Nothing much in snapd logs except for a polkit error - unsure if related:

root@wk3:/home# journalctl -b -u snapd.service

...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:91 (19 by maintainers)

github_iconTop GitHub Comments

4reactions
vazircommented, Sep 24, 2021

Today a have experienced a crash of the PRODUCTION microk8s 3-nodes “HA” cluster. It just auto-updated to 1.21.5 ! As a programmer, admin, my mind even cannot comprehend what people deciding for the crucial services packaging have in mind to choose such a DNA broken tool as a snap??? Why at all UBUNTU uses it, when it hardly suitable even for desktop apps, and not suitable for services at all??? What is some medic.stuff would buy their adverting as “highly available” and people die because it auto-updates??? They must drop snap for anything aside the desktop apps, and better drop it at all and use proven by years .deb …

4reactions
ShadowJonathancommented, Aug 1, 2021

I am not sure why you mention only security and in quotes.

Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it’s not, it’s an excuse that only exasperates this problem and the whole of snap for servers in general.

Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.

Any update that breaks the cluster is defeating its own purpose.

I’m glad you agree, then? I’d rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it’s about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd)

As a sysadmin, I control a developer’s software, when, where, and how. The developer doesn’t control my system, unless I tell it to. And even then, only on my own conditions.

Snaps violated this principle, and that’s why I’m incredibly displeased with them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

snap auto-refresh breaks cluster · Issue #1022 - GitHub
This morning a close-to-production cluster fell over after snap's auto-refresh "feature" failed on 3 of 4 worker nodes - looks like it ...
Read more >
Complete cluster failure after snap auto refresh - LXD
I have a 5 node cluster that are all now showing: time="2022-06-01T20:10:37Z" level=warning msg="Wait for other cluster nodes to upgrade ...
Read more >
Re-visiting update control on the desktop - snapd - snapcraft.io
Disable snap autorefresh or. Pin a snap version. Because: Now how i can tell to the user that a WRONG “stable” microk8s update...
Read more >
microk8s - Discuss Kubernetes
Dear community, while my MK8S cluster (4 nodes, Raspberry Pi, ... is running a SNAP REFRESH task: “error: snap “microk8s” has “auto-refresh” ...
Read more >
Snap refresh - Ubuntu
The snapd daemon periodically scans installed snaps for updates and will automatically refresh upgradeable packages to ensure the software is current. All ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found