Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Namerd memory leak

See original GitHub issue

Creating this issue as a follow-up regarding the discussions with @dadjeibaah in slack channel.

Issue Type:

Bug report

What happened: Namerd’s memory usage keeps increasing and will not perform garbage collection until its usage hit 80%~100%. A possible consequence is that, if the GC cannot be finished properly, the n4d pods will unable to serve any requests unless it’s restarted. A side fact is that, linkerd’s memory usuage has ever gone up higher than 20%, which might mean Namerd is not releasing the objects/resources properly which caused a memory leak.

Namerd Memory Usage Flow screen shot 2019-01-28 at 1 59 43 pm

Linkerd Memory Usage Flow screen shot 2019-01-28 at 1 58 33 pm

What you expected to happen: Namerd’s memory usage flow should be similar to what it’s shown for Linkerd.

How to reproduce it (as minimally and precisely as possible): N/A

Anything else we need to know?:

Environment:

linkerd/namerd version, config files: Namerd:1.6.0

Namerd config file

admin:
  ip: 0.0.0.0

telemetry:
- kind: io.l5d.prometheus
  prefix: l5d_n4d_

storage:
  kind: io.l5d.zk
  pathPrefix: /dtabs
  zkAddrs:
  - host: hostname1
    port: 2181
  - host: hostname2
    port: 2181
  - host: hostname3
    port: 2181

namers:
- kind: io.l5d.k8s
  prefix: -
  host: 127.0.0.1
  port: 8001
  transformers:
  - kind: io.l5d.k8s.daemonset
    namespace: mesh
    k8sHost: 127.0.0.1
    k8sPort: 8001
    port: in-http
    service: l5d
- kind: io.l5d.k8s
  prefix: -
  host: 127.0.0.1
  port: 8001
  transformers:
  - kind: io.l5d.k8s.daemonset
    namespace: mesh
    k8sHost: 127.0.0.1
    k8sPort: 8001
    port: in-grpc
    service: l5d
- kind: io.l5d.k8s
  prefix: -
  host: 127.0.0.1
  port: 8001

interfaces:
- kind: io.l5d.mesh
  ip: 0.0.0.0
  port: 4321
- kind: io.l5d.httpController
  ip: 0.0.0.0
  port: 4180

Platform, version, and config files (Kubernetes, DC/OS, etc): Kubernetes

Issue Analytics

State:
Created 5 years ago
Reactions:4
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

adleongcommented, Feb 7, 2019

@adw12382 thanks! This heap report is hugely helpful. We’ve got some theories about what might be happening that we’re trying to validate. I’ll keep this issue updated with our findings.

1reaction

adw12382commented, Jan 31, 2019

I investigated a bit and was able to generate the heap dump logs using jmap from the live n4d pod in our experimental environment.

Here are steps I used to generate the heap dump logs

Deploy Namerd into one of our clusters with replica number equals to 3.
Get into the pod and execute jmap -dump:format=b,file=namerdump.hprof {Java PID}. Thanks @dadjeibaah for sharing the command.
Copy the file out and analyze it using Eclipse - Memory Analyzer.

Then in the Leak Suspects section it shows the following - screen shot 2019-01-30 at 3 13 09 pm

It seems related to connections to zookeeper where we are storing our dtabs. Besides, according to the logs populated by n4d, every 15 minutes we receive around 2 thousands logs with message Attempting to observe dtab/*. I checked the source code . It seems checking whether the dtab exists or is valid, but since I am not familiar with Scala so do not know much detail regarding it.

The attachments are reports from Eclipse - Memory Analyzer, and also let me know if there is any other details I can provide.

dominatorTreeReport.zip ThreadDetailsReport.zip namerdumpLeakHunterReport.zip