question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to mitigate kamus-controller restarts impacting dependent systems?

See original GitHub issue

Describe the bug This is less of a bug, and more asking what should be done to get around this issue which causes operational troubles from the scheduled kamus-controller hourly restart.

I was asked to make a new issue regarding this.

I understand kamus-controller restarting every 60 minutes is normal.

I’m not sure how to make dependent systems behave properly when this scheduled “downtime” occurs.

  • fluxcd
  • CI/CD
  • monitoring
  • kamus-cli

I’ve seen kamus-controller cause problems when a new HelmRelease is pushed out.

An example of a dependent system having trouble during these restarts

{
  "caller": "loop.go:108",
  "component": "sync-loop",
  "err": "collating resources in cluster for sync: conversion webhook for soluto.com/v1alpha2, Kind=KamusSecret failed: Post https://kamus-controller.kamus.svc:443/api/v1/conversion-webhook?timeout=30s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
  "ts": "2020-11-04T00:04:18.249326875Z"
}

I’ve seen two types of failure modes: kamus-init-containers

  1. New/ modified HR- flux notices and updates HR
  2. Helm-operator notices- and tries to updates using the kamus-init-container
  3. Endpoint fails which gives the log above, this cause the CM or whatever you are creating with kamus-init-container to fail
  4. Since Kamus couldn’t secret- the helmrelease fails

KamusSecret failure:

  1. New/ modified HR- flux notices and updates HR and the KamusSecret
  2. kamus-controller restarts which causes a 1-3 minute delay on performing conversion.
  3. Something in the HR is dependent upon the corresponding output “secret” object- this delay causes an ordered dependency update failure.
  4. The dependent resource isn’t smart enough/ aware enough to retry and the helm hooks aren’t configured properly to handle this.

This is a mixture of 3 problems (flux, kamus, helm), so I don’t fault any one of them. My only thought is to add more replicas of the kamus-controller, but all I’m doing is reducing the failure rate (if this is even recommended), I’m not even sure if a poddisruptionbudget with 2 replicas would matter if the pod itself is causing the restart.

Versions used I can include my versions if desired, as this is a question on how to get around the design of kamus-controller restarts.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
apex-omontgomerycommented, Dec 8, 2020

Thank you for going above and beyond, I was honestly just hoping for some best practices. I’ll test this out Monday when I return.

0reactions
shaikatzcommented, Feb 15, 2021

Hi @wimo7083. Version 0.9.0.5 was just released (chart version 0.9.5). It was tested on my side and it was found restart free 😃

Please notice that KamusSecret v1alphav1 was dropped at version 0.9 - so in case you use it, please convert to v1alphav2 per the changelog documentation.

Please reopen if you still see that issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kamus controller always restarting · Issue #517
Kamus controller not restarting and be in a healthy state. ... How to mitigate kamus-controller restarts impacting dependent systems? #598.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found