Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Experiencing SegmentStore CrashLoopBackOff due to OutOfMemoryError

See original GitHub issue

With moderate IO, small transaction IO and pravega-benchmark with 5 streams (10 writter, 10 reader with 1000 Byte size, 100 events per sec and 10 segments) experiencing segmentstores are
going into CrashLoopBackOff with following exception after ~1day of IO operations.

java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to java_pid1.hprof ...
Heap dump file created [2640783672 bytes in 8.774 secs]
Aborting due to java.lang.OutOfMemoryError: GC overhead limit exceeded

Environment details: PKS / K8 with medium cluster:

3 master nodes @ large.cpu (4 CPU, 4 GB Ram, 16 GB Disk)
5 worker nodes @ xlarge.cpu(8 cpu, 8 GB Ram, 32 GB Disk)
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend

Pravega version: zk-closed-client-issue-0.5.0-2162.0bbfa42
Zookeeper Operator : 0.2.1
Pravega Operator: 0.3.2

NAMESPACE           NAME                                             READY     STATUS             RESTARTS   AGE
default             isilon-nfs-client-provisioner-67b7ffff86-vn6z6   1/1       Running            0          1d
default             pravega-benchmark                                1/1       Running            0          2d
default             pravega-bookie-0                                 1/1       Running            1          1d
default             pravega-bookie-1                                 1/1       Running            1          2d
default             pravega-bookie-2                                 1/1       Running            1          1d
default             pravega-bookie-3                                 1/1       Running            1          2d
default             pravega-bookie-4                                 1/1       Running            1          1d
default             pravega-operator-779879b48-hbcnw                 1/1       Running            0          2d
default             pravega-pravega-controller-c67d6b758-hdpp9       1/1       Running            1          1d
default             pravega-pravega-controller-c67d6b758-l9stc       1/1       Running            2          2d
default             pravega-pravega-segmentstore-0                   1/1       Running            67         2d
default             pravega-pravega-segmentstore-1                   0/1       CrashLoopBackOff   125        2d
default             pravega-pravega-segmentstore-2                   1/1       Running            130        2d
default             pravega-zk-0                                     1/1       Running            0          2d
default             pravega-zk-1                                     1/1       Running            0          3h
default             pravega-zk-2                                     1/1       Running            0          2d
default             zookeeper-operator-685bfcbbc5-rk5cs              1/1       Running            0          2d

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

deenavcommented, Apr 10, 2019

@RaulGracia I have restarted the same experiment with pravegaservice.readCacheSizeMB: "2048" value and the longevity test is running fine for ~10 hrs now.

0reactions

RaulGraciacommented, Apr 15, 2019

@sumit-bm @deenav The longevity run with the new configuration is working fine for +4 days, so I think we can close this issue. Please, update you pravega.yml according to the guidelines and configuration values defined in the provisioning plan. Thanks for the feedback.