[Doc] Prometheus alert example does not work
See original GitHub issueSuggestion / Problem
In the examples section you have suggested a few Prometheus Alerts, but I don’t see how the alert named ZookeeperContainerRestartedInTheLast5Minutes
could ever fulfill its condition expression.
The expression is as follows:
count(count_over_time(container_last_seen{container="zookeeper"}[5m])) > 2 * count(container_last_seen{container="zookeeper",pod=~".+-zookeeper-[0-9]+"})
I have observed these 2 queries and they seem to always move in lock-step (except one has double the value of the other of course). I don’t see how there could ever be a situation where the left side could be bigger than the right side.
I don’t have a suggestion for a fix since I don’t really understand the idea of this alert in the first place.
Documentation Link https://github.com/strimzi/strimzi-kafka-operator/blob/master/examples/metrics/prometheus-install/prometheus-rules.yaml
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (12 by maintainers)
Top GitHub Comments
I also tried to deploy metrics cluster on OCP 4.x and I hit the same issue as you had. Some fiddling with time intervals fixed it for me as well.
One more thought: While I was able to see the condition fulfilled in Prometheus, I still couldn’t get the alert to trigger. I noticed that here: https://github.com/strimzi/strimzi-kafka-operator/blob/master/examples/metrics/prometheus-install/prometheus-rules.yaml#L120 the condition must be fulfilled for at least 5min to trigger the alert. Can that actually happen when you’re only looking back in time 5min? My guess would be no, and then the alert would still never fire.