Observed reader and SegmentContainer failure
See original GitHub issueObserved reader failure and SegmentContainer failed with “ERROR i.p.s.s.s.StreamSegmentContainerRegistry - Critical failure for SegmentContainer Container Id = 29, State = FAILED. {} io.pravega.segmentstore.contracts.StreamingException: OperationProcessor stopped unexpectedly (no error) but DurableLog was not currently stopping” while running IO using Longevity with moderate workload (Total - 4 readers, 3 writers, ~50 events/sec, ~ 40 KB/s IO)
Observed 1 reader failure as well out of 4 during this run
INFO [2019-06-20 06:29:21,650] io.pravega.longevity.utils.PerformanceUtils: Readers (3/4): events:475,309,330, events/sec:946, KB/sec:725.89355
Note: In this cluster Longevity IO was Running fine for ~ 5d 11h
Environment details: PKS / K8 with medium cluster:
3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version: 0.5.0-2269.6f8a820
Zookeeper Operator : tristan1900/zookeeper:0.2
Pravega Operator: pravega/pravega-operator:0.3.2
Snip of Error:
2019-06-19 22:43:10,543 487252969 [core-23] WARN i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Too many rollover failures; closing.
java.util.concurrent.CompletionException: io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:874)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
at io.pravega.common.util.Retry$RetryAndThrowBase.lambda$null$3(Retry.java:214)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
... 12 common frames omitted
Caused by: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:708)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
... 8 common frames omitted
Caused by: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.persistMetadata(BookKeeperLog.java:802)
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.updateMetadata(BookKeeperLog.java:756)
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.rollover(BookKeeperLog.java:856)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
... 9 common frames omitted
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /pravega/pravega/segmentstore/containers/9/2/29
at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2272)
at org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:291)
at org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:287)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:284)
at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:270)
at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:33)
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.persistMetadata(BookKeeperLog.java:794)
... 12 common frames omitted
2019-06-19 22:43:10,543 487252969 [core-23] ERROR i.p.s.s.h.handler.AppendProcessor - Error (Segment = 'longevity/small/1.#epoch.0', Operation = 'append')
java.util.concurrent.CancellationException: BookKeeperLog has been closed.
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.lambda$close$1(BookKeeperLog.java:170)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.close(BookKeeperLog.java:170)
at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.handleRolloverFailure(BookKeeperLog.java:146)
at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
at io.pravega.segmentstore.storage.impl.bookkeeper.SequentialAsyncProcessor.lambda$runInternal$0(SequentialAsyncProcessor.java:85)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at io.pravega.common.concurrent.Futures$Loop.handleException(Futures.java:729)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-06-19 22:43:11,741 487254167 [core-13] ERROR i.p.s.s.s.StreamSegmentContainerRegistry - Critical failure for SegmentContainer Container Id = 29, State = FAILED. {}
io.pravega.segmentstore.contracts.StreamingException: OperationProcessor stopped unexpectedly (no error) but DurableLog was not currently stopping.
at io.pravega.segmentstore.server.logs.DurableLog.queueStoppedHandler(DurableLog.java:405)
at io.pravega.common.concurrent.Services$ShutdownListener.terminated(Services.java:120)
at com.google.common.util.concurrent.AbstractService$3.call(AbstractService.java:95)
at com.google.common.util.concurrent.AbstractService$3.call(AbstractService.java:92)
at com.google.common.util.concurrent.ListenerCallQueue$PerListenerQueue.run(ListenerCallQueue.java:205)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-06-19 22:44:26,710 487329136 [core-16] INFO i.p.s.s.h.ZKSegmentContainerMonitor - Container Changes: Desired = [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], Current = [21, 22, 23, 24, 25, 26, 27, 28, 30, 31], PendingTasks = [29], ToStart = [], ToStop = [].
2019-06-19 22:44:35,551 487337977 [epollEventLoopGroup-11-7] ERROR i.p.s.s.h.h.ServerConnectionInboundHandler - Caught exception on connection:
io.pravega.segmentstore.server.IllegalContainerStateException: Container 29 is in an invalid state for this operation. Expected: RUNNING; Actual: STARTING.
PS: complete log ~30 MB, will be sharing in slack separately.
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (9 by maintainers)
Top Results From Across the Web
Observed RocksDBException (I/O error, Read-only file system ...
Running longevity test (4 writer & 8 Reader) for 7d 19h. After that observed this error in segment store. 2019-06-11 10:50:19,066 674163103 ...
Read more >Segmentation Fault in Linux Containers (exit code 139)
Segmentation faults occur when a program tries to use memory that it's not allowed to access. They also arise when data is written...
Read more >Best practices for containerizing Python applications with Docker
Use Python WSGI for production. Run containers with least possible privilege (and never as root). Handle unhealthy states of your application.
Read more >Segmentation fault while running python in docker container
When I run the project without docker containers, it runs correctly. When I build Docker containers to install it, it does not ends...
Read more >EKG Abnormalities
ST segment elevation is maximal in leads with tallest R waves. Note high take off of the ST segment in leads V4-6; the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@RaulGracia I have posted Isilon endpoint details in internal channel.
The behavior described here looks correct. The reader gets an exception after waiting for the recovery of a segment container to complete, and that particular recovery took a while. The exception does not indicate an unrecoverable error, but instead the inability of getting a response within a bounded amount of time. This will happen occasionally in production use and the application needs to be able to deal with it. How to deal with it is application-specific.
It is possible that we need to improve the recovery so that we shorten recovery time, but it does not strike me as a
P0
. It is also possible that recent commits fix this issue.I’m dropping the priority and moving it to
0.6
.