How to mitigate kamus-controller restarts impacting dependent systems?
See original GitHub issueDescribe the bug This is less of a bug, and more asking what should be done to get around this issue which causes operational troubles from the scheduled kamus-controller hourly restart.
I was asked to make a new issue regarding this.
I understand kamus-controller restarting every 60 minutes is normal.
I’m not sure how to make dependent systems behave properly when this scheduled “downtime” occurs.
- fluxcd
- CI/CD
- monitoring
- kamus-cli
I’ve seen kamus-controller cause problems when a new HelmRelease is pushed out.
An example of a dependent system having trouble during these restarts
{
"caller": "loop.go:108",
"component": "sync-loop",
"err": "collating resources in cluster for sync: conversion webhook for soluto.com/v1alpha2, Kind=KamusSecret failed: Post https://kamus-controller.kamus.svc:443/api/v1/conversion-webhook?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
"ts": "2020-11-04T00:04:18.249326875Z"
}
I’ve seen two types of failure modes: kamus-init-containers
- New/ modified HR- flux notices and updates HR
- Helm-operator notices- and tries to updates using the kamus-init-container
- Endpoint fails which gives the log above, this cause the CM or whatever you are creating with kamus-init-container to fail
- Since Kamus couldn’t secret- the helmrelease fails
KamusSecret failure:
- New/ modified HR- flux notices and updates HR and the KamusSecret
- kamus-controller restarts which causes a 1-3 minute delay on performing conversion.
- Something in the HR is dependent upon the corresponding output “secret” object- this delay causes an ordered dependency update failure.
- The dependent resource isn’t smart enough/ aware enough to retry and the helm hooks aren’t configured properly to handle this.
This is a mixture of 3 problems (flux, kamus, helm), so I don’t fault any one of them. My only thought is to add more replicas of the kamus-controller, but all I’m doing is reducing the failure rate (if this is even recommended), I’m not even sure if a poddisruptionbudget with 2 replicas would matter if the pod itself is causing the restart.
Versions used I can include my versions if desired, as this is a question on how to get around the design of kamus-controller restarts.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (5 by maintainers)

Top Related StackOverflow Question
Thank you for going above and beyond, I was honestly just hoping for some best practices. I’ll test this out Monday when I return.
Hi @wimo7083. Version 0.9.0.5 was just released (chart version 0.9.5). It was tested on my side and it was found restart free 😃
Please notice that KamusSecret
v1alphav1was dropped at version 0.9 - so in case you use it, please convert tov1alphav2per the changelog documentation.Please reopen if you still see that issue.