snap auto-refresh breaks cluster
See original GitHub issueThis morning a close-to-production cluster fell over after snap’s auto-refresh “feature” failed on 3 of 4 worker nodes - looks like it hanged at the Copy snap "microk8s" data
step. microk8s could be restarted after aborting the auto-refresh, but this only worked after manually killing snapd… For a production-ready Kubernetes distribution I really think this is a far from acceptable default… Perhaps until snapd allows disabling auto-refreshes microk8s scripts could recommend running sudo snap set system refresh.hold=2050-01-01T15:04:05Z
or similar. Also a kubernetes-native integration with snapd refreshes could be considered (e.g. a prometheus/grafana dashboard/alert) to prompt manual updates - presumably one node at a time to begin with.
Otherwise microk8s is working rather well so thank you very much.
More details about the outage:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
10.aa.aa.aaa Ready <none> 38d v1.17.3
10.aa.aa.aaa NotReady <none> 18d v1.17.2
10.aa.aa.aaa NotReady <none> 38d v1.17.2
10.aa.aa.aaa NotReady <none> 18d v1.17.2
aaa-master Ready <none> 59d v1.17.3
microk8s is disabled…
root@wk3:/home# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.43.3 8689 stable canonical✓ core
kubectl 1.17.3 1424 1.17 canonical✓ classic
microk8s v1.17.2 1176 1.17 canonical✓ disabled,classic
root@wk3:/home# snap changes microk8s
ID Status Spawn Ready Summary
20 Doing today at 09:56 AEDT - Auto-refresh snap "microk8s"
Data copy appears hanged
root@wk3:/home# snap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 09:56 AEDT today at 09:56 AEDT Ensure prerequisites for "microk8s" are available
Done today at 09:56 AEDT today at 09:56 AEDT Download snap "microk8s" (1254) from channel "1.17/stable"
Done today at 09:56 AEDT today at 09:56 AEDT Fetch and check assertions for snap "microk8s" (1254)
Done today at 09:56 AEDT today at 09:56 AEDT Mount snap "microk8s" (1254)
Done today at 09:56 AEDT today at 09:56 AEDT Run pre-refresh hook of "microk8s" snap if present
Done today at 09:56 AEDT today at 09:57 AEDT Stop snap "microk8s" services
Done today at 09:56 AEDT today at 09:57 AEDT Remove aliases for snap "microk8s"
Done today at 09:56 AEDT today at 09:57 AEDT Make current revision for snap "microk8s" unavailable
Doing today at 09:56 AEDT - Copy snap "microk8s" data
Do today at 09:56 AEDT - Setup snap "microk8s" (1254) security profiles
Do today at 09:56 AEDT - Make snap "microk8s" (1254) available to the system
Do today at 09:56 AEDT - Automatically connect eligible plugs and slots of snap "microk8s"
Do today at 09:56 AEDT - Set automatic aliases for snap "microk8s"
Do today at 09:56 AEDT - Setup snap "microk8s" aliases
Do today at 09:56 AEDT - Run post-refresh hook of "microk8s" snap if present
Do today at 09:56 AEDT - Start snap "microk8s" (1254) services
Do today at 09:56 AEDT - Clean up "microk8s" (1254) install
Do today at 09:56 AEDT - Run configure hook of "microk8s" snap if present
Do today at 09:56 AEDT - Run health check of "microk8s" snap
Doing today at 09:56 AEDT - Consider re-refresh of "microk8s"
There doesn’t seem to be much to copy anyway:
root@wk3 /v/l/snapd# du -sh /var/lib/snapd/ /var/snap/ /snap
527M /var/lib/snapd/
74G /var/snap/
2.0G /snap
root@wk3 /s/microk8s# du -sh /snap/microk8s/*
737M /snap/microk8s/1176
737M /snap/microk8s/1254
root@wk3 /s/microk8s# du -sh /var/snap/microk8s/*
232K /var/snap/microk8s/1176
74G /var/snap/microk8s/common
Starting microk8s fails
user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress
Fails to abort…
root@wk3:/home# snap abort 20
root@wk3:/home# snap changes
ID Status Spawn Ready Summary
20 Abort today at 09:56 AEDT - Auto-refresh snap "microk8s"
user@wk3 /s/m/1254> sudo snap start microk8s
error: snap "microk8s" has "auto-refresh" change in progress
root@wk3:/home# snap enable microk8s
error: snap "microk8s" has "auto-refresh" change in progress
snapd service hangs when trying to stop it…
root@wk2 ~# systemctl stop snapd.service
(hangs)
have to resort to manually stopping the process
killall snapd
finally change is undone…
root@wk3:/home# snap changes
ID Status Spawn Ready Summary
20 Undone today at 09:56 AEDT today at 10:41 AEDT Auto-refresh snap "microk8s"
root@wk3:/home# snap tasks --last=auto-refresh
Status Spawn Ready Summary
Done today at 09:56 AEDT today at 10:41 AEDT Ensure prerequisites for "microk8s" are available
Undone today at 09:56 AEDT today at 10:41 AEDT Download snap "microk8s" (1254) from channel "1.17/stable"
Done today at 09:56 AEDT today at 10:41 AEDT Fetch and check assertions for snap "microk8s" (1254)
Undone today at 09:56 AEDT today at 10:41 AEDT Mount snap "microk8s" (1254)
Undone today at 09:56 AEDT today at 10:41 AEDT Run pre-refresh hook of "microk8s" snap if present
Undone today at 09:56 AEDT today at 10:41 AEDT Stop snap "microk8s" services
Undone today at 09:56 AEDT today at 10:41 AEDT Remove aliases for snap "microk8s"
Undone today at 09:56 AEDT today at 10:41 AEDT Make current revision for snap "microk8s" unavailable
Undone today at 09:56 AEDT today at 10:41 AEDT Copy snap "microk8s" data
Hold today at 09:56 AEDT today at 10:30 AEDT Setup snap "microk8s" (1254) security profiles
Hold today at 09:56 AEDT today at 10:30 AEDT Make snap "microk8s" (1254) available to the system
Hold today at 09:56 AEDT today at 10:30 AEDT Automatically connect eligible plugs and slots of snap "microk8s"
Hold today at 09:56 AEDT today at 10:30 AEDT Set automatic aliases for snap "microk8s"
Hold today at 09:56 AEDT today at 10:30 AEDT Setup snap "microk8s" aliases
Hold today at 09:56 AEDT today at 10:30 AEDT Run post-refresh hook of "microk8s" snap if present
Hold today at 09:56 AEDT today at 10:30 AEDT Start snap "microk8s" (1254) services
Hold today at 09:56 AEDT today at 10:30 AEDT Clean up "microk8s" (1254) install
Hold today at 09:56 AEDT today at 10:30 AEDT Run configure hook of "microk8s" snap if present
Hold today at 09:56 AEDT today at 10:30 AEDT Run health check of "microk8s" snap
Hold today at 09:56 AEDT today at 10:30 AEDT Consider re-refresh of "microk8s
root@wk3:/home# snap list
Name Version Rev Tracking Publisher Notes
core 16-2.43.3 8689 stable canonical✓ core
kubectl 1.17.3 1424 1.17 canonical✓ classic
microk8s v1.17.2 1176 1.17 canonical✓ classic
Nothing much in snapd logs except for a polkit error - unsure if related:
root@wk3:/home# journalctl -b -u snapd.service
...
Mar 09 06:11:34 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 16:11:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 16:11:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 09 19:06:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 09 19:06:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 02:51:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl", "microk8s"
Mar 10 02:51:31 wk3 snapd[15182]: autorefresh.go:397: auto-refresh: all snaps are up-to-date
Mar 10 09:56:31 wk3 snapd[15182]: storehelpers.go:436: cannot refresh: snap has no updates available: "core", "kubectl"
Mar 10 10:12:18 wk3 snapd[15182]: daemon.go:208: polkit error: Authorization requires interaction
Mar 10 10:39:24 wk3 systemd[1]: Stopping Snappy daemon...
Mar 10 10:39:24 wk3 snapd[15182]: main.go:155: Exiting on terminated signal.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: State 'stop-sigterm' timed out. Killing.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Killing process 15182 (snapd) with signal SIGKILL.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Main process exited, code=killed, status=9/KILL
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Failed with result 'timeout'.
Mar 10 10:40:54 wk3 systemd[1]: Stopped Snappy daemon.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Triggering OnFailure= dependencies.
Mar 10 10:40:54 wk3 systemd[1]: snapd.service: Found left-over process 16729 (sync) in control group while starting unit. Ignoring.
Mar 10 10:40:54 wk3 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Mar 10 10:40:54 wk3 systemd[1]: Starting Snappy daemon...
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: AppArmor status: apparmor is enabled and all features are available
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:346: started snapd/2.43.3 (series 16; classic) ubuntu/18.04 (amd64) linux/4.15.0-88-generic.
Mar 10 10:40:54 wk3 snapd[18170]: daemon.go:439: adjusting startup timeout by 45s (pessimistic estimate of 30s plus 5s per snap)
Mar 10 10:40:54 wk3 systemd[1]: Started Snappy daemon.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:91 (19 by maintainers)
Top GitHub Comments
Today a have experienced a crash of the PRODUCTION microk8s 3-nodes “HA” cluster. It just auto-updated to 1.21.5 ! As a programmer, admin, my mind even cannot comprehend what people deciding for the crucial services packaging have in mind to choose such a DNA broken tool as a snap??? Why at all UBUNTU uses it, when it hardly suitable even for desktop apps, and not suitable for services at all??? What is some medic.stuff would buy their adverting as “highly available” and people die because it auto-updates??? They must drop snap for anything aside the desktop apps, and better drop it at all and use proven by years .deb …
Your point was that security is paramount and absolute, that it should be the excuse that makes this problem okay, it’s not, it’s an excuse that only exasperates this problem and the whole of snap for servers in general.
Snaps are fine for user apps, those can deal with being restarted, crashing, shutting down, again and again. Server apps need more delicacy, planning, and oversight. Any admin/operator would not want the developer control over when, how, and why something will update, they want complete control over their systems, and the snaps auto-updating feature is a complete insult to that.
I’m glad you agree, then? I’d rather have a cluster which is outdated and vulnerable, and possibly get hacked, if it’s about my own oversight and my own fault (at least then i can tune it to my own schedule and my own system). With auto-update, and even the update window, that control is taken away from me, as now i have to scramble to make sure the eventual update will not fuck with my system, and then to do it manually, safe, and controlled to make sure it does not fuck over the data. (which it did for me, 1.2TB of scraping data, all corrupted because docker didnt want to close within 30 seconds, after which it got SIGKILLd)
As a sysadmin, I control a developer’s software, when, where, and how. The developer doesn’t control my system, unless I tell it to. And even then, only on my own conditions.
Snaps violated this principle, and that’s why I’m incredibly displeased with them.