question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cycling Kubernetes pods and scraping Prometheus metrics causing latency issues

See original GitHub issue

Issue Type:

  • Bug report
  • Feature request

What happened:

We deploy Linkerd as a daemonset (to around 500 nodes) on Kubernetes, and we’re scraping metrics using Prometheus.

We cycled every pod in the cluster (around 3400 in total) whilst doing some routine maintenance. We disabled Prometheus scraping prior to doing this (as part of unrelated work) and reenabled it the next day.

After scraping was reenabled latency of requests sent via Linkerd increased by around 100-150% and some requests started failing. We saw errors in the Linkerd logs like this:

D 0525 14:42:21.804 UTC THREAD32 TraceId:5ff10920f019ca62: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.224.140.9:80. Remote Info: Not Available
    at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$1.channelInactive(ChannelTransport.scala:188)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)

D 0525 14:17:43.098 UTC THREAD28 TraceId:611a81b0939bcb4b: Failed mid-stream. Terminating stream, closing connection
com.twitter.io.Reader$ReaderDiscarded: This writer's reader has been discarded
  at com.twitter.finagle.netty4.http.StreamTransports$$anon$1.discard(StreamTransports.scala:71)
  at com.twitter.finagle.http.DelayedReleaseService$$anon$2$$anon$3.discard(DelayedReleaseService.scala:58)
  at io.buoyant.router.Http$Router$.$anonfun$responseDiscarder$1(Http.scala:45)
  at io.buoyant.router.Http$Router$.$anonfun$responseDiscarder$1$adapted(Http.scala:43)
  at com.twitter.finagle.buoyant.RetryFilter.$anonfun$dispatch$3(RetryFilter.scala:71)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
  at com.twitter.util.Try$.apply(Try.scala:15)
  at com.twitter.finagle.buoyant.RetryFilter.$anonfun$dispatch$2(RetryFilter.scala:71)
  at com.twitter.finagle.buoyant.RetryFilter.$anonfun$dispatch$2$adapted(RetryFilter.scala:70)

The response time from /admin/metrics/prometheus endpoint went from ~0.5 seconds to over 13 seconds on some Linkerd pods, whilst /admin/metrics.json and /admin/metrics/influxdb stayed around the same.

After restarting one of the most affected pods, the response time of the Prometheus endpoint dropped back down to ~0.5 seconds.

What you expected to happen:

  • Response time on the Prometheus scrape endpoint to remain low after cycling lots of pods.
  • Service upstream latencies to be unaffected.

How to reproduce it (as minimally and precisely as possible):

We don’t have a minimal reproduction just yet - just wanted to flag this and see if this was a known issue before creating one.

Anything else we need to know?:

The number of unique series (metric + combination of labels) on some pods with highest scrape latency went from ~8k to ~35k. Digging into this a bit more, when we rotated pods on one of the nodes in our staging environment (draining a node and letting pods schedule on it again) the Linkerd metrics seemed to include “old” metrics for upstream services that had previously been proxied by Linkerd (albeit with all values 0). I wonder if cycling all pods in the cluster caused the Linkerd pods to collectively process significantly more metrics (for old upstream services) which somehow caused the latency issues/errors?

There didn’t seem to be any significant impact on CPU or memory usage.

We’d be happy to (privately) share more logs and metrics if you can think of anything that would be helpful 👍

Environment:

  • linkerd/namerd version, config files: linkerd 1.3.4
  • Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes 1.7
  • Cloud provider or hardware configuration: AWS EC2 instances, mixture of c4.large and m4.2xlarge K8s resource spec:
    Limits:
      cpu:                      1
      memory:                   800Mi
    Requests:
      cpu:                      300m
      memory:                   400Mi

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
chrisgoffinetcommented, Jun 20, 2018

@milesbxf trying to share the love, check out #2010. I know of a large user of Linkerd using Prom + many services, and we discovered that the /admin interface wasnt setup with its own thread pool. Our tail latencies have been impacted since it was in critical path of requests.

0reactions
adleongcommented, Jul 16, 2018

Closing due to inactivity. Please feel free to reopen if you’re still seeing issues.

Read more comments on GitHub >

github_iconTop Results From Across the Web

A guide to kubernetes monitoring & health metrics - Circonus
An increase in pod scheduling latency may cause other problems, and may also indicate resourcing issues in your cluster. Events. In addition to ......
Read more >
6 Metrics to Watch for on Your Kubernetes Cluster - Komodor
Learn about the top six metrics you should be monitoring on your Kubernetes clusters to ensure they're in a healthy state.
Read more >
The four Golden Signals of Kubernetes monitoring - Sysdig
Golden Signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective: Latency,...
Read more >
K8s KPIs with Kuberhealthy - Kubernetes
We calculate this by counting the total number of nodes, deployments, statefulsets, persistent volumes, services, pods, and jobs.
Read more >
Troubleshooting Managed Service for Prometheus
This document describes some problems you might encounter when using Google Cloud Managed Service for Prometheus and provides information on diagnosing and ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found