Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Increased service latency when collecting stats with large number of services

See original GitHub issue

Issue Type:

Bug report
Feature request

What happened: When collecting stats from /admin/metrics.json or the /admin/metrics/prometheus we see added latency at the high percentiles (p999, max). This happens with a large number of services (I picked 800 for example)

We noticed when using /admin/metrics/prometheus it makes it even worse. We observe through slow_cooker and other clients added overhead.

What you expected to happen: Not increase p999, max latency.

How to reproduce it (as minimally and precisely as possible):

setup linkerd w/ dtab below
use slow_cooker with -interval 1s -qps 100 -concurrency 10 -host <generate a list of 800 names comma separated> http://linkerd:8080
give it about 2 minutes to setup all stats
watch the slow_cooker latency, you should see <10ms
start curling /admin/metrics.json every second, the max will goto about 70ms~
if you start curling the /admin/metrics/promethesus endpoint the slow_cooker latency will go even higher

I also confirmed seeing a higher request_latency.max from the stats being a bit higher than slow_cooker but its very easy to see a push in latency by just hitting those endpoints. If you stop curling all together the latency DROPs back <10ms

Anything else we need to know?:

Environment:

linkerd/namerd version, config files:

  port: 9990
  ip: 0.0.0.0

telemetry:
- kind: io.l5d.prometheus

routers:
- protocol: http
  client:
    failureAccrual:
      kind: none
  dtab: |
    /svc/* => /$/inet/172.18.0.3/80;
  servers:
  - port: 8080
    ip: 0.0.0.0

Platform, version, and config files (Kubernetes, DC/OS, etc):
Cloud provider or hardware configuration:

Issue Analytics

State:
Created 5 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

2reactions

zackangelocommented, Jun 21, 2018

Related to the previous comment, it does appear that linkerd is a lot happier with a constant committed heap.

The dramatic drop in full GCs (indicated by red triangles) can be seen here:

We’ll measure and report back if the GC behavior changes further with the new changes (efficiency and thread pool PRs).

1reaction

zackangelocommented, Jun 21, 2018

GC performance seems much improved on a node we’re running with Alex’s two PRs. In the three hours we’ve been running it we’ve only seen one full GC 😮:

Top Results From Across the Web

Latency vs Throughput - Understanding the Difference

In this article we compare latency and throughput. We outline the difference between these two commonly confused concepts.

What Is Network Latency? - Amazon AWS

Network latency is the delay in network communication. It shows the time that data takes to transfer across the network. Networks with a...

Tail Latency Might Matter More Than You Think - Marc's Blog

Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently. Things like: "my service ...

Everything You Know About Latency Is Wrong

Latency values are never uniformly distributed, nor independent, concretely it means that the values shown in the table are probably extremely ...

Network Latency - Common Causes and Best Solutions | IR

High latency decreases communication bandwidth, and can be temporary or permanent, depending on the source of the delays. Latency is measured in milliseconds, ......