kubernetes.pods.running reporting incorrectly
See original GitHub issueOutput of the info page
==============
Agent (v6.4.2)
==============
Status date: 2018-08-24 00:05:55.602398 UTC
Pid: 352
Python Version: 2.7.15
Logs:
Check Runners: 2
Log Level: WARNING
kubernetes_apiserver
--------------------
Total Runs: 53293
Metric Samples: 0, Total: 0
Events: 0, Total: 0
Service Checks: 0, Total: 0
Average Execution Time : 4ms
(a ton of unrelated and possibly sensitive stuff removed)
Additional environment details (Operating System, Cloud provider, etc): GKE - kubernetes 1.10
Steps to reproduce the issue:
Have a k8s cluster monitored by datadog where at least one pod is in a failed state (or anything that’s not running
)
Describe the results you received: The metric appears to count pods in all statuses, including failed
Describe the results you expected:
Simple fix: the metric is correctly filtered to only pods where status.phase
== Running
Enhancement: the metric is replaced by kubernetes.pods.count
with status.phase
added as a tag, allowing accurate reporting of pods in e.g. Failed
state. This would enable more useful metrics and reporting.
Note that a similar metric is exposed in kubernetes_state
when its configured, but that shouldn’t excuse the inaccuracy of the other one.
Additional information you deem important (e.g. issue happens only occasionally):
Issue Analytics
- State:
- Created 5 years ago
- Reactions:10
- Comments:14 (1 by maintainers)
Top GitHub Comments
I’d also like to add that we consistently see inaccurate measurements for the pods running metrics
the numbers are off by 100% during scaling periods and can take up to 10 minutes to stabilize. Turning off interpolation in the metric graphs shows a sawtooth measurement.
We face this issue too.
kubernetes.pods.running
shows only a single pod most of the time. Sometime it changes to a floating-point number (up to 1.4) even when there are definitely several pods running.