Auth & device registry keep getting OOMKilled with default limits
See original GitHub issueI’m running Hono 0.7 on Kubernetes using the deployment script with next to no changes to the resources, i.e. I haven’t changed anything as described in https://www.eclipse.org/hono/deployment/resource-limitation/.
kubectl describe node
extract:
Allocatable:
cpu: 2
ephemeral-storage: 63941352Ki
hugepages-2Mi: 0
memory: 7661808Ki
pods: 110
System Info:
Machine ID: 996ebefb2cc44da1ae864d3c078ca1eb
System UUID: EC265B0A-9CA8-1B71-3E38-9FB5BC8E14FF
Boot ID: 9f7d86b5-42ca-44cf-a8aa-1f6c9b9494d9
Kernel Version: 4.4.0-1066-aws
OS Image: Ubuntu 16.04.2 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.2
Kubelet Version: v1.11.2
Kube-Proxy Version: v1.11.2
Non-terminated Pods: (20 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
...
hono hono-service-auth-7949d57744-4ch5t 0 (0%) 0 (0%) 196Mi (2%) 196Mi (2%)
hono hono-service-device-registry-85d87b66dd-m8nsl 0 (0%) 0 (0%) 256Mi (3%) 256Mi (3%)
...
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 778m (38%) 498m (24%)
memory 3582622656 (45%) 3768528Ki (49%)
However I’m noticing that two services, the auth service and the device registry, are regularly OOMKilled (every few hours):
$ kubectl -n hono get pod
NAME READY STATUS RESTARTS AGE
grafana-5645865df8-4prlg 1/1 Running 5 14d
hono-adapter-http-vertx-7d78bc5f4d-pwmgd 1/1 Running 6 14d
hono-adapter-mqtt-vertx-799bd5858c-stw6b 1/1 Running 3 5d
hono-artemis-797c695777-gljgv 1/1 Running 5 14d
hono-dispatch-router-5fd7756dfb-bmnzf 1/1 Running 5 14d
hono-service-auth-7949d57744-4ch5t 1/1 Running 172 11d
hono-service-device-registry-85d87b66dd-m8nsl 1/1 Running 79 14d
influxdb-784f8f677c-xd6k5 1/1 Running 5 14d
Here’s the relevant describe pod
output for the auth service (the device registry is similar):
$ kubectl -n hono describe pod -l app=hono-service-auth
...
Last State: Terminated
Reason: OOMKilled
...
Restart Count: 172
Limits:
memory: 196Mi
Requests:
memory: 196Mi
To confirm -Xmx150m
is set:
$ kubectl -n hono describe deployment -l app=hono-service-auth
...
Environment:
SPRING_CONFIG_LOCATION: file:///etc/hono/
SPRING_PROFILES_ACTIVE: authentication-impl,dev
LOGGING_CONFIG: classpath:logback-spring.xml
_JAVA_OPTIONS: -Xmx150m
KUBERNETES_NAMESPACE: (v1:metadata.namespace)
I can increase the limits but wanted to know whether it’s just me or are the defaults wrong? What could be the cause of these memory issues? Thanks!
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (6 by maintainers)
Top Results From Across the Web
OOMKilled: Troubleshooting Kubernetes Memory Requests ...
The OOMKilled: Limit Overcommit error can occur when the sum of pod limits is greater than the available memory on the node. So...
Read more >source-controller pod restarting (OOMKilled) #192 - GitHub
I've re-deployed a newer version (0.2.1) but the restarts keep happening (about 2 every half hour). $> k describe po -n gotk-system source- ......
Read more >Out-of-memory (OOM) in Kubernetes – Part 4: Pod evictions ...
The article states it explicitly: “The kubelet evaluates eviction thresholds based on its configured housekeeping-interval which defaults to 10s ...
Read more >How to handle OOMkilled errors in Kubernetes - IT Briefcase
The simplest way to remedy an OOMkilled error is to increase the memory limit and then recreate the container. This can be done...
Read more >How to Fix OOMKilled Kubernetes Error (Exit Code 137)
OOMKilled (exit code 137) occur when K8s pods are killed because they use more memory than their limits. Learn how to resolve the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I completely see your point, sounds reasonable.
As a quick information : I configured the
-Xmx80m
for the device registry locally (not running in a container) and monitored the heap usage and the log while firing telemetry messages to the HTTP adapter. No problems, a clean sawtooth behaviour, no OOM exceptions, no malfunctions.So go ahead and tweak Xmx down and let’s see if it works in your environment as well. Then we can lower the Xmx settings in the default descriptors of Hono, too (but being careful, setting them too low may cause too many problems, and it is hard to find out what is
too low
). To me it looks like your original problem came from the small amount of memory that was left for the pod after the JVM took it’s full memory assignment. And this would be exactly addressed by what you proposed: leave the kubernetes limits as they were and lower the Xmx.If you tried it out, please post your results here, thanks!
@ghys is this still an issue? If not, can you close this issue?