Experiencing SegmentStore CrashLoopBackOff due to OutOfMemoryError
See original GitHub issueWith moderate IO, small transaction IO and pravega-benchmark with 5 streams (10 writter, 10 reader with 1000 Byte size, 100 events per sec and 10 segments) experiencing segmentstores are
going into CrashLoopBackOff
with following exception after ~1day of IO operations.
java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid1.hprof ...
Heap dump file created [2640783672 bytes in 8.774 secs]
Aborting due to java.lang.OutOfMemoryError: GC overhead limit exceeded
Environment details: PKS / K8 with medium cluster:
3 master nodes @ large.cpu (4 CPU, 4 GB Ram, 16 GB Disk)
5 worker nodes @ xlarge.cpu(8 cpu, 8 GB Ram, 32 GB Disk)
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version: zk-closed-client-issue-0.5.0-2162.0bbfa42
Zookeeper Operator : 0.2.1
Pravega Operator: 0.3.2
NAMESPACE NAME READY STATUS RESTARTS AGE
default isilon-nfs-client-provisioner-67b7ffff86-vn6z6 1/1 Running 0 1d
default pravega-benchmark 1/1 Running 0 2d
default pravega-bookie-0 1/1 Running 1 1d
default pravega-bookie-1 1/1 Running 1 2d
default pravega-bookie-2 1/1 Running 1 1d
default pravega-bookie-3 1/1 Running 1 2d
default pravega-bookie-4 1/1 Running 1 1d
default pravega-operator-779879b48-hbcnw 1/1 Running 0 2d
default pravega-pravega-controller-c67d6b758-hdpp9 1/1 Running 1 1d
default pravega-pravega-controller-c67d6b758-l9stc 1/1 Running 2 2d
default pravega-pravega-segmentstore-0 1/1 Running 67 2d
default pravega-pravega-segmentstore-1 0/1 CrashLoopBackOff 125 2d
default pravega-pravega-segmentstore-2 1/1 Running 130 2d
default pravega-zk-0 1/1 Running 0 2d
default pravega-zk-1 1/1 Running 0 3h
default pravega-zk-2 1/1 Running 0 2d
default zookeeper-operator-685bfcbbc5-rk5cs 1/1 Running 0 2d
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Kubernetes CrashLoopBackOff Error: What It Is and How to Fix It
This error indicates that a pod failed to start, Kubernetes tried to restart it, and it continued to fail repeatedly. To make sure...
Read more >Kubernetes CrashLoopBackOff: What it is, and how to fix it?
CrashLoopBackOff is a Kubernetes state representing a restart ... The memory limits are too low, so the container is Out Of Memory killed....
Read more >Troubleshoot and Fix Kubernetes CrashLoopBackoff Status
The CrashLoopBackoff status is a notification that the pod is being restarted due to an error and is waiting for the specified 'backoff'...
Read more >Understanding Kubernetes CrashLoopBackoff Events
CrashLoopBackOff is a status message that indicates one of your pods is in a constant state of flux—one or more containers are failing...
Read more >Troubleshoot: Pod Crashloopbackoff - Devtron
A pod stuck in a CrashLoopBackOff is an error while deploying applications to Kubernetes. While in CrashLoopBackOff, the pod keeps crashing.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@RaulGracia I have restarted the same experiment with
pravegaservice.readCacheSizeMB: "2048"
value and the longevity test is running fine for ~10 hrs now.@sumit-bm @deenav The longevity run with the new configuration is working fine for +4 days, so I think we can close this issue. Please, update you
pravega.yml
according to the guidelines and configuration values defined in the provisioning plan. Thanks for the feedback.