[0.9.1] Failure Accrual and FD Exhaustion
See original GitHub issueI’ve been seeing a slow FD leak that correlates very closely w/ failureaccrual.
The metric that increases forever is jvm/fd_count
in my case we are exhausting 16k fds.
A few instances of failureaccrual seem to be consuming 496 fds.
"jvm/fd_count" : 496.0,
"jvm/fd_limit" : 16384.0,
"rt/http/dst/id/#/io.l5d.serversets/sd/nobody/prod/packager-proxy/failure_accrual/probes" : 28,
"rt/http/dst/id/#/io.l5d.serversets/sd/nobody/prod/packager-proxy/failure_accrual/removals" : 2,
"rt/http/dst/id/#/io.l5d.serversets/sd/nobody/prod/packager-proxy/failure_accrual/removed_for_ms" : 2827504,
"rt/http/dst/id/#/io.l5d.serversets/sd/nobody/prod/packager-proxy/failure_accrual/revivals" : 2,
How to reproduce:
- downstream returns 5xx
- retry budget exhausts
- instance is marked failed
here is my linkerd config
---
admin:
port: 9990
namers:
- kind: io.l5d.serversets
zkAddrs:
- host: mesos-master01of2.thebrighttag.com
port: 2181
- host: mesos-master02of2.thebrighttag.com
port: 2181
- host: mesos-master03of2.thebrighttag.com
port: 2181
telemetry:
- kind: io.l5d.zipkin
host: mesos-master01of2.thebrighttag.com
port: 9410
sampleRate: 0.001
telemetry:
- kind: io.l5d.prometheus
routers:
- protocol: http
label: http
dstPrefix: /host
maxChunkKB: 16
maxHeadersKB: 16
maxInitialLineKB: 16
maxRequestKB: 102400 # 100MB
maxResponseKB: 102400 # 100MB
compressionLevel: 9
interpreter:
kind: io.l5d.namerd
dst: /#/io.l5d.serversets/sd/nobody/prod/namerd:thrift
namespace: default
responseClassifier:
kind: io.l5d.retryableIdempotent5XX
client:
retries:
budget:
minRetriesPerSec: 5
percentCanRetry: 0.5
ttlSecs: 15
backoff:
kind: jittered
minMs: 10
maxMs: 10000
servers:
- port: 80
label: prod/linkerd
ip: 10.150.150.227
announce:
- /#/io.l5d.serversets/prod/linkerd
- port: 80
ip: 127.0.0.1
announcers:
- kind: io.l5d.serversets
pathPrefix: /sd/nobody
zkAddrs:
- host: mesos-master01of2.thebrighttag.com
port: 2181
- host: mesos-master02of2.thebrighttag.com
port: 2181
- host: mesos-master03of2.thebrighttag.com
port: 2181
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
systemd and the "fd" exhaustion - kdecherf ~ %
Failed to get journal fd: Too many open files. This can happen, fd exhaustion is a common issue so let's check the current...
Read more >Clinical Review (Aubagio) - FDA
MS disease characteristics of patients in primary efficacy analysis ( ... fatigue, cognitive impairment, and bowel/bladder dysfunction.
Read more >How to diagnose 'TOO MANY OPEN FILES' issues? - IBM
Answer. Applications or servers can sometimes fail with an error indicating that there are too many open files for the current process.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@adleong seems to fix the issue for me.
Awesome! The above PR has been merged. Please reopen this issue if the problem reappears.