question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiencing SegmentStore CrashLoopBackOff due to OutOfMemoryError

See original GitHub issue

With moderate IO, small transaction IO and pravega-benchmark with 5 streams (10 writter, 10 reader with 1000 Byte size, 100 events per sec and 10 segments) experiencing segmentstores are
going into CrashLoopBackOff with following exception after ~1day of IO operations.

java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid1.hprof ...
Heap dump file created [2640783672 bytes in 8.774 secs]
Aborting due to java.lang.OutOfMemoryError: GC overhead limit exceeded

Environment details: PKS / K8 with medium cluster:

3 master nodes @ large.cpu (4 CPU, 4 GB Ram, 16 GB Disk)
5 worker nodes @ xlarge.cpu(8 cpu, 8 GB Ram, 32 GB Disk)
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend

Pravega version: zk-closed-client-issue-0.5.0-2162.0bbfa42
Zookeeper Operator : 0.2.1
Pravega Operator: 0.3.2
NAMESPACE           NAME                                             READY     STATUS             RESTARTS   AGE
default             isilon-nfs-client-provisioner-67b7ffff86-vn6z6   1/1       Running            0          1d
default             pravega-benchmark                                1/1       Running            0          2d
default             pravega-bookie-0                                 1/1       Running            1          1d
default             pravega-bookie-1                                 1/1       Running            1          2d
default             pravega-bookie-2                                 1/1       Running            1          1d
default             pravega-bookie-3                                 1/1       Running            1          2d
default             pravega-bookie-4                                 1/1       Running            1          1d
default             pravega-operator-779879b48-hbcnw                 1/1       Running            0          2d
default             pravega-pravega-controller-c67d6b758-hdpp9       1/1       Running            1          1d
default             pravega-pravega-controller-c67d6b758-l9stc       1/1       Running            2          2d
default             pravega-pravega-segmentstore-0                   1/1       Running            67         2d
default             pravega-pravega-segmentstore-1                   0/1       CrashLoopBackOff   125        2d
default             pravega-pravega-segmentstore-2                   1/1       Running            130        2d
default             pravega-zk-0                                     1/1       Running            0          2d
default             pravega-zk-1                                     1/1       Running            0          3h
default             pravega-zk-2                                     1/1       Running            0          2d
default             zookeeper-operator-685bfcbbc5-rk5cs              1/1       Running            0          2d

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
deenavcommented, Apr 10, 2019

@RaulGracia I have restarted the same experiment with pravegaservice.readCacheSizeMB: "2048" value and the longevity test is running fine for ~10 hrs now.

0reactions
RaulGraciacommented, Apr 15, 2019

@sumit-bm @deenav The longevity run with the new configuration is working fine for +4 days, so I think we can close this issue. Please, update you pravega.yml according to the guidelines and configuration values defined in the provisioning plan. Thanks for the feedback.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kubernetes CrashLoopBackOff Error: What It Is and How to Fix It
This error indicates that a pod failed to start, Kubernetes tried to restart it, and it continued to fail repeatedly. To make sure...
Read more >
Kubernetes CrashLoopBackOff: What it is, and how to fix it?
CrashLoopBackOff is a Kubernetes state representing a restart ... The memory limits are too low, so the container is Out Of Memory killed....
Read more >
Troubleshoot and Fix Kubernetes CrashLoopBackoff Status
The CrashLoopBackoff status is a notification that the pod is being restarted due to an error and is waiting for the specified 'backoff'...
Read more >
Understanding Kubernetes CrashLoopBackoff Events
CrashLoopBackOff is a status message that indicates one of your pods is in a constant state of flux—one or more containers are failing...
Read more >
Troubleshoot: Pod Crashloopbackoff - Devtron
A pod stuck in a CrashLoopBackOff is an error while deploying applications to Kubernetes. While in CrashLoopBackOff, the pod keeps crashing.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found