question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auth & device registry keep getting OOMKilled with default limits

See original GitHub issue

I’m running Hono 0.7 on Kubernetes using the deployment script with next to no changes to the resources, i.e. I haven’t changed anything as described in https://www.eclipse.org/hono/deployment/resource-limitation/.

kubectl describe node extract:

Allocatable:
 cpu:                2
 ephemeral-storage:  63941352Ki
 hugepages-2Mi:      0
 memory:             7661808Ki
 pods:               110
System Info:
 Machine ID:                 996ebefb2cc44da1ae864d3c078ca1eb
 System UUID:                EC265B0A-9CA8-1B71-3E38-9FB5BC8E14FF
 Boot ID:                    9f7d86b5-42ca-44cf-a8aa-1f6c9b9494d9
 Kernel Version:             4.4.0-1066-aws
 OS Image:                   Ubuntu 16.04.2 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.11.2
 Kube-Proxy Version:         v1.11.2
Non-terminated Pods:         (20 in total)
  Namespace                  Name                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                               ------------  ----------  ---------------  -------------
...
  hono                       hono-service-auth-7949d57744-4ch5t                 0 (0%)        0 (0%)      196Mi (2%)       196Mi (2%)
  hono                       hono-service-device-registry-85d87b66dd-m8nsl      0 (0%)        0 (0%)      256Mi (3%)       256Mi (3%)
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests          Limits
  --------  --------          ------
  cpu       778m (38%)        498m (24%)
  memory    3582622656 (45%)  3768528Ki (49%)

However I’m noticing that two services, the auth service and the device registry, are regularly OOMKilled (every few hours):

$ kubectl -n hono get pod
NAME                                            READY     STATUS    RESTARTS   AGE
grafana-5645865df8-4prlg                        1/1       Running   5          14d
hono-adapter-http-vertx-7d78bc5f4d-pwmgd        1/1       Running   6          14d
hono-adapter-mqtt-vertx-799bd5858c-stw6b        1/1       Running   3          5d
hono-artemis-797c695777-gljgv                   1/1       Running   5          14d
hono-dispatch-router-5fd7756dfb-bmnzf           1/1       Running   5          14d
hono-service-auth-7949d57744-4ch5t              1/1       Running   172        11d
hono-service-device-registry-85d87b66dd-m8nsl   1/1       Running   79         14d
influxdb-784f8f677c-xd6k5                       1/1       Running   5          14d

Here’s the relevant describe pod output for the auth service (the device registry is similar):

$ kubectl -n hono describe pod -l app=hono-service-auth
...
    Last State:     Terminated
      Reason:       OOMKilled
...
    Restart Count:  172
    Limits:
      memory:  196Mi
    Requests:
      memory:   196Mi

To confirm -Xmx150m is set:

$ kubectl -n hono describe deployment -l app=hono-service-auth
...
    Environment:
      SPRING_CONFIG_LOCATION:  file:///etc/hono/
      SPRING_PROFILES_ACTIVE:  authentication-impl,dev
      LOGGING_CONFIG:          classpath:logback-spring.xml
      _JAVA_OPTIONS:           -Xmx150m
      KUBERNETES_NAMESPACE:     (v1:metadata.namespace)

I can increase the limits but wanted to know whether it’s just me or are the defaults wrong? What could be the cause of these memory issues? Thanks!

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
sysexcontrolcommented, Sep 11, 2018

I completely see your point, sounds reasonable.

As a quick information : I configured the -Xmx80m for the device registry locally (not running in a container) and monitored the heap usage and the log while firing telemetry messages to the HTTP adapter. No problems, a clean sawtooth behaviour, no OOM exceptions, no malfunctions.

So go ahead and tweak Xmx down and let’s see if it works in your environment as well. Then we can lower the Xmx settings in the default descriptors of Hono, too (but being careful, setting them too low may cause too many problems, and it is hard to find out what is too low). To me it looks like your original problem came from the small amount of memory that was left for the pod after the JVM took it’s full memory assignment. And this would be exactly addressed by what you proposed: leave the kubernetes limits as they were and lower the Xmx.

If you tried it out, please post your results here, thanks!

0reactions
sophokles73commented, Mar 25, 2019

@ghys is this still an issue? If not, can you close this issue?

Read more comments on GitHub >

github_iconTop Results From Across the Web

OOMKilled: Troubleshooting Kubernetes Memory Requests ...
The OOMKilled: Limit Overcommit error can occur when the sum of pod limits is greater than the available memory on the node. So...
Read more >
source-controller pod restarting (OOMKilled) #192 - GitHub
I've re-deployed a newer version (0.2.1) but the restarts keep happening (about 2 every half hour). $> k describe po -n gotk-system source- ......
Read more >
Out-of-memory (OOM) in Kubernetes – Part 4: Pod evictions ...
The article states it explicitly: “The kubelet evaluates eviction thresholds based on its configured housekeeping-interval which defaults to 10s ...
Read more >
How to handle OOMkilled errors in Kubernetes - IT Briefcase
The simplest way to remedy an OOMkilled error is to increase the memory limit and then recreate the container. This can be done...
Read more >
How to Fix OOMKilled Kubernetes Error (Exit Code 137)
OOMKilled (exit code 137) occur when K8s pods are killed because they use more memory than their limits. Learn how to resolve the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found