Cycling Kubernetes pods and scraping Prometheus metrics causing latency issues
See original GitHub issueIssue Type:
- Bug report
- Feature request
What happened:
We deploy Linkerd as a daemonset (to around 500 nodes) on Kubernetes, and we’re scraping metrics using Prometheus.
We cycled every pod in the cluster (around 3400 in total) whilst doing some routine maintenance. We disabled Prometheus scraping prior to doing this (as part of unrelated work) and reenabled it the next day.
After scraping was reenabled latency of requests sent via Linkerd increased by around 100-150% and some requests started failing. We saw errors in the Linkerd logs like this:
D 0525 14:42:21.804 UTC THREAD32 TraceId:5ff10920f019ca62: Failed mid-stream. Terminating stream, closing connection
com.twitter.finagle.ChannelClosedException: ChannelException at remote address: /10.224.140.9:80. Remote Info: Not Available
at com.twitter.finagle.netty4.transport.ChannelTransport$$anon$1.channelInactive(ChannelTransport.scala:188)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
D 0525 14:17:43.098 UTC THREAD28 TraceId:611a81b0939bcb4b: Failed mid-stream. Terminating stream, closing connection
com.twitter.io.Reader$ReaderDiscarded: This writer's reader has been discarded
at com.twitter.finagle.netty4.http.StreamTransports$$anon$1.discard(StreamTransports.scala:71)
at com.twitter.finagle.http.DelayedReleaseService$$anon$2$$anon$3.discard(DelayedReleaseService.scala:58)
at io.buoyant.router.Http$Router$.$anonfun$responseDiscarder$1(Http.scala:45)
at io.buoyant.router.Http$Router$.$anonfun$responseDiscarder$1$adapted(Http.scala:43)
at com.twitter.finagle.buoyant.RetryFilter.$anonfun$dispatch$3(RetryFilter.scala:71)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at com.twitter.util.Try$.apply(Try.scala:15)
at com.twitter.finagle.buoyant.RetryFilter.$anonfun$dispatch$2(RetryFilter.scala:71)
at com.twitter.finagle.buoyant.RetryFilter.$anonfun$dispatch$2$adapted(RetryFilter.scala:70)
The response time from /admin/metrics/prometheus
endpoint went from ~0.5 seconds to over 13 seconds on some Linkerd pods, whilst /admin/metrics.json
and /admin/metrics/influxdb
stayed around the same.
After restarting one of the most affected pods, the response time of the Prometheus endpoint dropped back down to ~0.5 seconds.
What you expected to happen:
- Response time on the Prometheus scrape endpoint to remain low after cycling lots of pods.
- Service upstream latencies to be unaffected.
How to reproduce it (as minimally and precisely as possible):
We don’t have a minimal reproduction just yet - just wanted to flag this and see if this was a known issue before creating one.
Anything else we need to know?:
The number of unique series (metric + combination of labels) on some pods with highest scrape latency went from ~8k to ~35k. Digging into this a bit more, when we rotated pods on one of the nodes in our staging environment (draining a node and letting pods schedule on it again) the Linkerd metrics seemed to include “old” metrics for upstream services that had previously been proxied by Linkerd (albeit with all values 0). I wonder if cycling all pods in the cluster caused the Linkerd pods to collectively process significantly more metrics (for old upstream services) which somehow caused the latency issues/errors?
There didn’t seem to be any significant impact on CPU or memory usage.
We’d be happy to (privately) share more logs and metrics if you can think of anything that would be helpful 👍
Environment:
- linkerd/namerd version, config files: linkerd 1.3.4
- Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes 1.7
- Cloud provider or hardware configuration: AWS EC2 instances, mixture of c4.large and m4.2xlarge K8s resource spec:
Limits:
cpu: 1
memory: 800Mi
Requests:
cpu: 300m
memory: 400Mi
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (7 by maintainers)
Top GitHub Comments
@milesbxf trying to share the love, check out #2010. I know of a large user of Linkerd using Prom + many services, and we discovered that the /admin interface wasnt setup with its own thread pool. Our tail latencies have been impacted since it was in critical path of requests.
Closing due to inactivity. Please feel free to reopen if you’re still seeing issues.