Namerd memory leak
See original GitHub issueCreating this issue as a follow-up regarding the discussions with @dadjeibaah in slack channel.
Issue Type:
- Bug report
What happened: Namerd’s memory usage keeps increasing and will not perform garbage collection until its usage hit 80%~100%. A possible consequence is that, if the GC cannot be finished properly, the n4d pods will unable to serve any requests unless it’s restarted. A side fact is that, linkerd’s memory usuage has ever gone up higher than 20%, which might mean Namerd is not releasing the objects/resources properly which caused a memory leak.
Namerd Memory Usage Flow
Linkerd Memory Usage Flow
What you expected to happen: Namerd’s memory usage flow should be similar to what it’s shown for Linkerd.
How to reproduce it (as minimally and precisely as possible): N/A
Anything else we need to know?:
Environment:
- linkerd/namerd version, config files: Namerd:1.6.0
Namerd config file
admin:
ip: 0.0.0.0
telemetry:
- kind: io.l5d.prometheus
prefix: l5d_n4d_
storage:
kind: io.l5d.zk
pathPrefix: /dtabs
zkAddrs:
- host: hostname1
port: 2181
- host: hostname2
port: 2181
- host: hostname3
port: 2181
namers:
- kind: io.l5d.k8s
prefix: -
host: 127.0.0.1
port: 8001
transformers:
- kind: io.l5d.k8s.daemonset
namespace: mesh
k8sHost: 127.0.0.1
k8sPort: 8001
port: in-http
service: l5d
- kind: io.l5d.k8s
prefix: -
host: 127.0.0.1
port: 8001
transformers:
- kind: io.l5d.k8s.daemonset
namespace: mesh
k8sHost: 127.0.0.1
k8sPort: 8001
port: in-grpc
service: l5d
- kind: io.l5d.k8s
prefix: -
host: 127.0.0.1
port: 8001
interfaces:
- kind: io.l5d.mesh
ip: 0.0.0.0
port: 4321
- kind: io.l5d.httpController
ip: 0.0.0.0
port: 4180
- Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:5 (2 by maintainers)
Top GitHub Comments
@adw12382 thanks! This heap report is hugely helpful. We’ve got some theories about what might be happening that we’re trying to validate. I’ll keep this issue updated with our findings.
I investigated a bit and was able to generate the heap dump logs using
jmap
from the live n4d pod in our experimental environment.Here are steps I used to generate the heap dump logs
jmap -dump:format=b,file=namerdump.hprof {Java PID}
. Thanks @dadjeibaah for sharing the command.Then in the Leak Suspects section it shows the following -
It seems related to connections to zookeeper where we are storing our dtabs. Besides, according to the logs populated by n4d, every 15 minutes we receive around 2 thousands logs with message
Attempting to observe dtab/*
. I checked the source code . It seems checking whether the dtab exists or is valid, but since I am not familiar with Scala so do not know much detail regarding it.The attachments are reports from Eclipse - Memory Analyzer, and also let me know if there is any other details I can provide.
dominatorTreeReport.zip ThreadDetailsReport.zip namerdumpLeakHunterReport.zip