[kubernetes] Misleading metric name
See original GitHub issueThere’s a metric in the Kubernetes check called kubernetes.pods.running
. This metric is produced here, and it receives the pod list from here, i.e. it gets the pod list locally from the kubelet, which makes sense.
However, the kubelet’s local pod list will be a list of all pods scheduled to that node, not literally running on it, which makes the name of this metric somewhat misleading IMO. I.e. this list will include job pods, which will have phase Succeeded
(shown as Completed
in kubectl get pods
output), pods which have been scheduled to the node but not yet started (phase Pending
, e.g. while the node is mounting the volumes needed for the pod), etc.
For example, here’s the count of the different pod phases on a node of ours:
» curl -s http://foo-node:10255/pods/ | jq ".items[].status.phase" | sort | uniq -c
1 "Failed"
1 "Pending"
49 "Running"
5 "Succeeded"
We were confused that we saw more pods running reported by Datadog than the above number (“Running”) or by kubectl describe node
and found out this is why.
I think there are 3 sensible approaches:
- Rename this metric to something like
kubernetes.pods.scheduled
. This would be more truthful to what it’s actually reporting. The code logic doesn’t need to change at all, but the purpose of the metric is more obvious. - Introduce a new metric, which does filter the list of pods to only those with
.status.phase == "Running"
. - Add the
phase
of the pod to the current metric or to a new metric as a tag. Better granularity is good!
I also tried looking in agent6 code to see if it’s any different there (because afaik that uses internal Go code to report k8s metrics rather than using the Python checks) but my understanding of Go is not sufficient to find where agent6 is collecting this.
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:6 (2 by maintainers)
Top GitHub Comments
As long as I can get the real number of running pods I’m happy. If I get all the different phases too, even better!
I see this came up (again) in #2597