question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[kubernetes] Misleading metric name

See original GitHub issue

There’s a metric in the Kubernetes check called kubernetes.pods.running. This metric is produced here, and it receives the pod list from here, i.e. it gets the pod list locally from the kubelet, which makes sense.

However, the kubelet’s local pod list will be a list of all pods scheduled to that node, not literally running on it, which makes the name of this metric somewhat misleading IMO. I.e. this list will include job pods, which will have phase Succeeded (shown as Completed in kubectl get pods output), pods which have been scheduled to the node but not yet started (phase Pending, e.g. while the node is mounting the volumes needed for the pod), etc.

For example, here’s the count of the different pod phases on a node of ours:

» curl -s http://foo-node:10255/pods/ | jq ".items[].status.phase" | sort | uniq -c
   1 "Failed"
   1 "Pending"
  49 "Running"
   5 "Succeeded"

We were confused that we saw more pods running reported by Datadog than the above number (“Running”) or by kubectl describe node and found out this is why.

I think there are 3 sensible approaches:

  • Rename this metric to something like kubernetes.pods.scheduled. This would be more truthful to what it’s actually reporting. The code logic doesn’t need to change at all, but the purpose of the metric is more obvious.
  • Introduce a new metric, which does filter the list of pods to only those with .status.phase == "Running".
  • Add the phase of the pod to the current metric or to a new metric as a tag. Better granularity is good!

I also tried looking in agent6 code to see if it’s any different there (because afaik that uses internal Go code to report k8s metrics rather than using the Python checks) but my understanding of Go is not sufficient to find where agent6 is collecting this.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:3
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
andor44commented, May 31, 2018

As long as I can get the real number of running pods I’m happy. If I get all the different phases too, even better!

0reactions
andor44commented, Nov 22, 2018

I see this came up (again) in #2597

Read more comments on GitHub >

github_iconTop Results From Across the Web

Well-Known Labels, Annotations and Taints - Kubernetes
The Kubernetes API server (part of the control plane) sets this label on all namespaces. The label value is set to the name...
Read more >
Monitoring Kubernetes Performance Metrics | Datadog
Learn about the key Kubernetes metrics that can help you track your orchestrated, containerized infrastructure.
Read more >
Kubernetes Metric List - Juniper Networks
Metric Name Type Description apiserver_watch_events_sizes histogram Watch event size distribution in bytes. apiserver_watch_events_total counter Number of events sent in watch clients. authentication_attempts counter Counter of authenticated...
Read more >
How to deploy k8s metrics server and use it for monitoring
The Kubernetes Metrics Server is a resource metrics monitoring tool ... If you want to configure the name of the cluster, you can...
Read more >
Understanding Default, Custom, and Missing Metrics
Custom metrics will not be auto-detected and the unit will be incorrect unless this naming convention is followed. For instance, custom_byte_expvar will not ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found