Linkerd pods in strange state - possible memory leak
See original GitHub issueIssue Type: Bug report
What happened:
We use linkerd on the server side to serve as a load balancer for our grpc application, running on kubernetes. The clients don’t use linkerd. Once the clients started using our application, the memory of our linkerd pods started increasing one by one. None of the pods were killed. Instead, there was a threshold in which the pod stopped receiving requests. Once all pods reached this state, the client couldn’t reach the API anymore. We tried several version changes, including linkerd, grpc and kubernetes, but issue persisted. Keep in mind our other server applications use the same configuration and didn’t seem to have this issue. Our application returns a grpc status exception (on purpose) for almost half of the requests. For testing, we removed this exception throw and the issue stopped happening.
How to reproduce it (as minimally and precisely as possible):
With a simple GRPC application, send requests to it, and return grpc status exceptions in the response observer: responseObserver.onError(new StatusRuntimeException(Status.NOT_FOUND))
Environment:
- Linkerd 1.1.2 (also happened with 1.2.0 and 1.3.0)
- k8s version 1.7.0 (also happened with 1.7.8)
- grpc version: 1.5.0 (also happened with 1.3.0)
kind: ConfigMap
metadata:
name: l5d-config
namespace: (...)
data:
config.yaml: |-
admin:
ip: 0.0.0.0
port: 9990
namers:
- kind: io.l5d.k8s
experimental: true
host: localhost
port: 8001
telemetry:
- kind: io.l5d.prometheus
- kind: io.l5d.recentRequests
sampleRate: 0.25
usage:
orgId: linkerd-daemonset-grpc
routers:
- protocol: h2
label: outgoing
experimental: true
dtab: |
(...)
identifier:
kind: io.l5d.header.path
segments: 1
servers:
- port: 4140
ip: 0.0.0.0
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: l5d
name: l5d
namespace: (...)
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: l5d
annotations:
linkerd. io/scrape: 'true'
prometheus. io/scrape: 'true'
spec:
volumes:
- name: l5d-config
configMap:
name: "l5d-config"
containers:
- name: l5d
image: buoyantio/linkerd:1.1.2
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
args:
- /io.buoyant/linkerd/config/config.yaml
ports:
- name: outgoing
containerPort: 4140
hostPort: 4140
- name: admin
containerPort: 9990
volumeMounts:
- name: "l5d-config"
mountPath: "/io.buoyant/linkerd/config"
readOnly: true
- name: kubectl
image: buoyantio/kubectl:v1.6.2
args:
- "proxy"
- "-p"
- "8001"
apiVersion: v1
kind: Service
metadata:
annotations:
"service.beta.kubernetes.io/aws-load-balancer-internal": 0.0.0.0/0
name: l5d
namespace: (...)
spec:
selector:
app: l5d
type: LoadBalancer
ports:
- name: outgoing
port: 4140
- name: incoming
port: 4141
- name: admin
port: 9990
Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:6 (5 by maintainers)
Top GitHub Comments
@siggy Just to confirm, we’ve seen a similar issue when using Linkerd as a sidecar for TLS, below is the gc log of Linkerd process running over 4 days.
Purple: Tenured Generation Green: GC times Line
The initial heap dump analysis suggest it might be similar to the issue reported in #1696:
I will post more details once we have further findings.
Closing due to inactivity. @cb16 please re-open if you are still seeing this issue with the latest Linkerd version.