question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Increased service latency when collecting stats with large number of services

See original GitHub issue

Issue Type:

  • Bug report
  • Feature request

What happened: When collecting stats from /admin/metrics.json or the /admin/metrics/prometheus we see added latency at the high percentiles (p999, max). This happens with a large number of services (I picked 800 for example)

We noticed when using /admin/metrics/prometheus it makes it even worse. We observe through slow_cooker and other clients added overhead.

What you expected to happen: Not increase p999, max latency.

How to reproduce it (as minimally and precisely as possible):

  1. setup linkerd w/ dtab below
  2. use slow_cooker with -interval 1s -qps 100 -concurrency 10 -host <generate a list of 800 names comma separated> http://linkerd:8080
  3. give it about 2 minutes to setup all stats
  4. watch the slow_cooker latency, you should see <10ms
  5. start curling /admin/metrics.json every second, the max will goto about 70ms~
  6. if you start curling the /admin/metrics/promethesus endpoint the slow_cooker latency will go even higher

I also confirmed seeing a higher request_latency.max from the stats being a bit higher than slow_cooker but its very easy to see a push in latency by just hitting those endpoints. If you stop curling all together the latency DROPs back <10ms

Anything else we need to know?:

Environment:

  • linkerd/namerd version, config files:
  port: 9990
  ip: 0.0.0.0

telemetry:
- kind: io.l5d.prometheus

routers:
- protocol: http
  client:
    failureAccrual:
      kind: none
  dtab: |
    /svc/* => /$/inet/172.18.0.3/80;
  servers:
  - port: 8080
    ip: 0.0.0.0
  • Platform, version, and config files (Kubernetes, DC/OS, etc):
  • Cloud provider or hardware configuration:

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
zackangelocommented, Jun 21, 2018

Related to the previous comment, it does appear that linkerd is a lot happier with a constant committed heap.

The dramatic drop in full GCs (indicated by red triangles) can be seen here: image

We’ll measure and report back if the GC behavior changes further with the new changes (efficiency and thread pool PRs).

1reaction
zackangelocommented, Jun 21, 2018

GC performance seems much improved on a node we’re running with Alex’s two PRs. In the three hours we’ve been running it we’ve only seen one full GC 😮:

screen shot 2018-06-21 at 2 19 09 pm
Read more comments on GitHub >

github_iconTop Results From Across the Web

Latency vs Throughput - Understanding the Difference
In this article we compare latency and throughput. We outline the difference between these two commonly confused concepts.
Read more >
What Is Network Latency? - Amazon AWS
Network latency is the delay in network communication. It shows the time that data takes to transfer across the network. Networks with a...
Read more >
Tail Latency Might Matter More Than You Think - Marc's Blog
Tail latency, also known as high-percentile latency, refers to high latencies that clients see fairly infrequently. Things like: "my service ...
Read more >
Everything You Know About Latency Is Wrong
Latency values are never uniformly distributed, nor independent, concretely it means that the values shown in the table are probably extremely ...
Read more >
Network Latency - Common Causes and Best Solutions | IR
High latency decreases communication bandwidth, and can be temporary or permanent, depending on the source of the delays. Latency is measured in milliseconds, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found