question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linkerd pods in strange state - possible memory leak

See original GitHub issue

Issue Type: Bug report

What happened:

We use linkerd on the server side to serve as a load balancer for our grpc application, running on kubernetes. The clients don’t use linkerd. Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were killed. Instead, there was a threshold in which the pod stopped receiving requests. Once all pods reached this state, the client couldn’t reach the API anymore. We tried several version changes, including linkerd, grpc and kubernetes, but issue persisted. Keep in mind our other server applications use the same configuration and didn’t seem to have this issue. Our application returns a grpc status exception (on purpose) for almost half of the requests. For testing, we removed this exception throw and the issue stopped happening.

How to reproduce it (as minimally and precisely as possible):

With a simple GRPC application, send requests to it, and return grpc status exceptions in the response observer: responseObserver.onError(new StatusRuntimeException(Status.NOT_FOUND))

Environment:

  • Linkerd 1.1.2 (also happened with 1.2.0 and 1.3.0)
  • k8s version 1.7.0 (also happened with 1.7.8)
  • grpc version: 1.5.0 (also happened with 1.3.0)
kind: ConfigMap
metadata:
  name: l5d-config
  namespace: (...)
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9990
    namers:
    - kind: io.l5d.k8s
      experimental: true
      host: localhost
      port: 8001
    telemetry:
    - kind: io.l5d.prometheus
    - kind: io.l5d.recentRequests
      sampleRate: 0.25
    usage:
      orgId: linkerd-daemonset-grpc
    routers:
    - protocol: h2
      label: outgoing
      experimental: true
      dtab: |
        (...)
      identifier:
        kind: io.l5d.header.path
        segments: 1
      servers:
      - port: 4140
        ip: 0.0.0.0

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: l5d
  name: l5d
  namespace: (...)
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: l5d
      annotations:
        linkerd. io/scrape: 'true'
        prometheus. io/scrape: 'true'
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: "l5d-config"
      containers:
      - name: l5d
        image: buoyantio/linkerd:1.1.2
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 4140
          hostPort: 4140
        - name: admin
          containerPort: 9990
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true
      - name: kubectl
        image: buoyantio/kubectl:v1.6.2
        args:
        - "proxy"
        - "-p"
        - "8001"

apiVersion: v1
kind: Service
metadata:
  annotations:
    "service.beta.kubernetes.io/aws-load-balancer-internal": 0.0.0.0/0
  name: l5d
  namespace: (...)
spec:
  selector:
    app: l5d
  type: LoadBalancer
  ports:
  - name: outgoing
    port: 4140
  - name: incoming
    port: 4141
  - name: admin
    port: 9990

Thanks!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:2
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
ismaelkacommented, Nov 20, 2017

@siggy Just to confirm, we’ve seen a similar issue when using Linkerd as a sidecar for TLS, below is the gc log of Linkerd process running over 4 days.

image Purple: Tenured Generation Green: GC times Line

The initial heap dump analysis suggest it might be similar to the issue reported in #1696: screen shot 2017-11-20 at 10 00 57 pm

I will post more details once we have further findings.

0reactions
adleongcommented, Jan 2, 2018

Closing due to inactivity. @cb16 please re-open if you are still seeing this issue with the latest Linkerd version.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Linkerd pods in strange state - possible memory leak
Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were...
Read more >
Kubernetes Failure Stories
Kubernetes Failure Stories. A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.
Read more >
11 Ways (Not) to Get Hacked | Kubernetes
The Kubernetes scheduler will search etcd for pod definitions that do not have a node. It then sends the pods it finds to...
Read more >
Kubernetes Security – Common Vulnerabilities and ...
You should always look for security vulnerabilities and exposures and apply patches to them as and when available.
Read more >
Kubernetes reinvented virtual machines in a good sense
Docker Swarm doesn't concern itself with the concept of Pods because you don't always ... so one memory leak in one pod can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found