Increased service latency when collecting stats with large number of services
See original GitHub issueIssue Type:
- Bug report
- Feature request
What happened: When collecting stats from /admin/metrics.json or the /admin/metrics/prometheus we see added latency at the high percentiles (p999, max). This happens with a large number of services (I picked 800 for example)
We noticed when using /admin/metrics/prometheus it makes it even worse. We observe through slow_cooker
and other clients added overhead.
What you expected to happen: Not increase p999, max latency.
How to reproduce it (as minimally and precisely as possible):
- setup linkerd w/ dtab below
- use
slow_cooker
with -interval 1s -qps 100 -concurrency 10 -host <generate a list of 800 names comma separated> http://linkerd:8080 - give it about 2 minutes to setup all stats
- watch the slow_cooker latency, you should see <10ms
- start curling /admin/metrics.json every second, the max will goto about 70ms~
- if you start curling the /admin/metrics/promethesus endpoint the slow_cooker latency will go even higher
I also confirmed seeing a higher request_latency.max
from the stats being a bit higher than slow_cooker
but its very easy to see a push in latency by just hitting those endpoints. If you stop curling all together the latency DROPs back <10ms
Anything else we need to know?:
Environment:
- linkerd/namerd version, config files:
port: 9990
ip: 0.0.0.0
telemetry:
- kind: io.l5d.prometheus
routers:
- protocol: http
client:
failureAccrual:
kind: none
dtab: |
/svc/* => /$/inet/172.18.0.3/80;
servers:
- port: 8080
ip: 0.0.0.0
- Platform, version, and config files (Kubernetes, DC/OS, etc):
- Cloud provider or hardware configuration:
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
Related to the previous comment, it does appear that linkerd is a lot happier with a constant committed heap.
The dramatic drop in full GCs (indicated by red triangles) can be seen here:
We’ll measure and report back if the GC behavior changes further with the new changes (efficiency and thread pool PRs).
GC performance seems much improved on a node we’re running with Alex’s two PRs. In the three hours we’ve been running it we’ve only seen one full GC 😮: