Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Linkerd pods in strange state - possible memory leak

See original GitHub issue

Issue Type: Bug report

What happened:

We use linkerd on the server side to serve as a load balancer for our grpc application, running on kubernetes. The clients don’t use linkerd. Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were killed. Instead, there was a threshold in which the pod stopped receiving requests. Once all pods reached this state, the client couldn’t reach the API anymore. We tried several version changes, including linkerd, grpc and kubernetes, but issue persisted. Keep in mind our other server applications use the same configuration and didn’t seem to have this issue. Our application returns a grpc status exception (on purpose) for almost half of the requests. For testing, we removed this exception throw and the issue stopped happening.

How to reproduce it (as minimally and precisely as possible):

With a simple GRPC application, send requests to it, and return grpc status exceptions in the response observer: responseObserver.onError(new StatusRuntimeException(Status.NOT_FOUND))

Environment:

Linkerd 1.1.2 (also happened with 1.2.0 and 1.3.0)
k8s version 1.7.0 (also happened with 1.7.8)
grpc version: 1.5.0 (also happened with 1.3.0)

kind: ConfigMap
metadata:
  name: l5d-config
  namespace: (...)
data:
  config.yaml: |-
    admin:
      ip: 0.0.0.0
      port: 9990
    namers:
    - kind: io.l5d.k8s
      experimental: true
      host: localhost
      port: 8001
    telemetry:
    - kind: io.l5d.prometheus
    - kind: io.l5d.recentRequests
      sampleRate: 0.25
    usage:
      orgId: linkerd-daemonset-grpc
    routers:
    - protocol: h2
      label: outgoing
      experimental: true
      dtab: |
        (...)
      identifier:
        kind: io.l5d.header.path
        segments: 1
      servers:
      - port: 4140
        ip: 0.0.0.0

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    app: l5d
  name: l5d
  namespace: (...)
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: l5d
      annotations:
        linkerd. io/scrape: 'true'
        prometheus. io/scrape: 'true'
    spec:
      volumes:
      - name: l5d-config
        configMap:
          name: "l5d-config"
      containers:
      - name: l5d
        image: buoyantio/linkerd:1.1.2
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        args:
        - /io.buoyant/linkerd/config/config.yaml
        ports:
        - name: outgoing
          containerPort: 4140
          hostPort: 4140
        - name: admin
          containerPort: 9990
        volumeMounts:
        - name: "l5d-config"
          mountPath: "/io.buoyant/linkerd/config"
          readOnly: true
      - name: kubectl
        image: buoyantio/kubectl:v1.6.2
        args:
        - "proxy"
        - "-p"
        - "8001"

apiVersion: v1
kind: Service
metadata:
  annotations:
    "service.beta.kubernetes.io/aws-load-balancer-internal": 0.0.0.0/0
  name: l5d
  namespace: (...)
spec:
  selector:
    app: l5d
  type: LoadBalancer
  ports:
  - name: outgoing
    port: 4140
  - name: incoming
    port: 4141
  - name: admin
    port: 9990

Thanks!

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

ismaelkacommented, Nov 20, 2017

@siggy Just to confirm, we’ve seen a similar issue when using Linkerd as a sidecar for TLS, below is the gc log of Linkerd process running over 4 days.

Purple: Tenured Generation Green: GC times Line