Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to mitigate kamus-controller restarts impacting dependent systems?

See original GitHub issue

Describe the bug This is less of a bug, and more asking what should be done to get around this issue which causes operational troubles from the scheduled kamus-controller hourly restart.

I was asked to make a new issue regarding this.

I understand kamus-controller restarting every 60 minutes is normal.

I’m not sure how to make dependent systems behave properly when this scheduled “downtime” occurs.

fluxcd
CI/CD
monitoring
kamus-cli

I’ve seen kamus-controller cause problems when a new HelmRelease is pushed out.

An example of a dependent system having trouble during these restarts

{
  "caller": "loop.go:108",
  "component": "sync-loop",
  "err": "collating resources in cluster for sync: conversion webhook for soluto.com/v1alpha2, Kind=KamusSecret failed: Post https://kamus-controller.kamus.svc:443/api/v1/conversion-webhook?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
  "ts": "2020-11-04T00:04:18.249326875Z"
}

I’ve seen two types of failure modes: kamus-init-containers

New/ modified HR- flux notices and updates HR
Helm-operator notices- and tries to updates using the kamus-init-container
Endpoint fails which gives the log above, this cause the CM or whatever you are creating with kamus-init-container to fail
Since Kamus couldn’t secret- the helmrelease fails

KamusSecret failure:

New/ modified HR- flux notices and updates HR and the KamusSecret
kamus-controller restarts which causes a 1-3 minute delay on performing conversion.
Something in the HR is dependent upon the corresponding output “secret” object- this delay causes an ordered dependency update failure.
The dependent resource isn’t smart enough/ aware enough to retry and the helm hooks aren’t configured properly to handle this.

This is a mixture of 3 problems (flux, kamus, helm), so I don’t fault any one of them. My only thought is to add more replicas of the kamus-controller, but all I’m doing is reducing the failure rate (if this is even recommended), I’m not even sure if a poddisruptionbudget with 2 replicas would matter if the pod itself is causing the restart.

Versions used I can include my versions if desired, as this is a question on how to get around the design of kamus-controller restarts.

Issue Analytics

State:
Created 3 years ago
Comments:9 (5 by maintainers)

Top GitHub Comments

1reaction

apex-omontgomerycommented, Dec 8, 2020

Thank you for going above and beyond, I was honestly just hoping for some best practices. I’ll test this out Monday when I return.

0reactions

shaikatzcommented, Feb 15, 2021

Hi @wimo7083. Version 0.9.0.5 was just released (chart version 0.9.5). It was tested on my side and it was found restart free 😃

Please notice that KamusSecret v1alphav1 was dropped at version 0.9 - so in case you use it, please convert to v1alphav2 per the changelog documentation.

Please reopen if you still see that issue.