question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Observed reader and SegmentContainer failure

See original GitHub issue

Observed reader failure and SegmentContainer failed with “ERROR i.p.s.s.s.StreamSegmentContainerRegistry - Critical failure for SegmentContainer Container Id = 29, State = FAILED. {} io.pravega.segmentstore.contracts.StreamingException: OperationProcessor stopped unexpectedly (no error) but DurableLog was not currently stopping” while running IO using Longevity with moderate workload (Total - 4 readers, 3 writers, ~50 events/sec, ~ 40 KB/s IO)

Observed 1 reader failure as well out of 4 during this run

INFO  [2019-06-20 06:29:21,650] io.pravega.longevity.utils.PerformanceUtils: Readers (3/4): events:475,309,330, events/sec:946, KB/sec:725.89355

Note: In this cluster Longevity IO was Running fine for ~ 5d 11h

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version: 0.5.0-2269.6f8a820
Zookeeper Operator : tristan1900/zookeeper:0.2
Pravega Operator: pravega/pravega-operator:0.3.2 

Snip of Error:

2019-06-19 22:43:10,543 487252969 [core-23] WARN  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Too many rollover failures; closing.
java.util.concurrent.CompletionException: io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:874)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at io.pravega.common.util.Retry$RetryAndThrowBase.lambda$null$3(Retry.java:214)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        ... 12 common frames omitted
Caused by: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:708)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
        ... 8 common frames omitted
Caused by: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.persistMetadata(BookKeeperLog.java:802)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.updateMetadata(BookKeeperLog.java:756)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.rollover(BookKeeperLog.java:856)
        at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
        ... 9 common frames omitted
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /pravega/pravega/segmentstore/containers/9/2/29
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2272)
        at org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:291)
        at org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:287)
        at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
        at org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:284)
        at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:270)
        at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:33)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.persistMetadata(BookKeeperLog.java:794)
        ... 12 common frames omitted
2019-06-19 22:43:10,543 487252969 [core-23] ERROR i.p.s.s.h.handler.AppendProcessor - Error (Segment = 'longevity/small/1.#epoch.0', Operation = 'append')
java.util.concurrent.CancellationException: BookKeeperLog has been closed.
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.lambda$close$1(BookKeeperLog.java:170)
        at java.util.ArrayList.forEach(ArrayList.java:1257)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.close(BookKeeperLog.java:170)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.handleRolloverFailure(BookKeeperLog.java:146)
        at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
        at io.pravega.segmentstore.storage.impl.bookkeeper.SequentialAsyncProcessor.lambda$runInternal$0(SequentialAsyncProcessor.java:85)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at io.pravega.common.concurrent.Futures$Loop.handleException(Futures.java:729)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-06-19 22:43:11,741 487254167 [core-13] ERROR i.p.s.s.s.StreamSegmentContainerRegistry - Critical failure for SegmentContainer Container Id = 29, State = FAILED. {}
io.pravega.segmentstore.contracts.StreamingException: OperationProcessor stopped unexpectedly (no error) but DurableLog was not currently stopping.
        at io.pravega.segmentstore.server.logs.DurableLog.queueStoppedHandler(DurableLog.java:405)
        at io.pravega.common.concurrent.Services$ShutdownListener.terminated(Services.java:120)
        at com.google.common.util.concurrent.AbstractService$3.call(AbstractService.java:95)
        at com.google.common.util.concurrent.AbstractService$3.call(AbstractService.java:92)
        at com.google.common.util.concurrent.ListenerCallQueue$PerListenerQueue.run(ListenerCallQueue.java:205)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2019-06-19 22:44:26,710 487329136 [core-16] INFO  i.p.s.s.h.ZKSegmentContainerMonitor - Container Changes: Desired = [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], Current = [21, 22, 23, 24, 25, 26, 27, 28, 30, 31], PendingTasks = [29], ToStart = [], ToStop = [].
2019-06-19 22:44:35,551 487337977 [epollEventLoopGroup-11-7] ERROR i.p.s.s.h.h.ServerConnectionInboundHandler - Caught exception on connection:
io.pravega.segmentstore.server.IllegalContainerStateException: Container 29 is in an invalid state for this operation. Expected: RUNNING; Actual: STARTING.

PS: complete log ~30 MB, will be sharing in slack separately.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
vedanthhcommented, Jun 27, 2019

@RaulGracia I have posted Isilon endpoint details in internal channel.

0reactions
fpjcommented, Jul 1, 2019

The behavior described here looks correct. The reader gets an exception after waiting for the recovery of a segment container to complete, and that particular recovery took a while. The exception does not indicate an unrecoverable error, but instead the inability of getting a response within a bounded amount of time. This will happen occasionally in production use and the application needs to be able to deal with it. How to deal with it is application-specific.

It is possible that we need to improve the recovery so that we shorten recovery time, but it does not strike me as a P0. It is also possible that recent commits fix this issue.

I’m dropping the priority and moving it to 0.6.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Observed RocksDBException (I/O error, Read-only file system ...
Running longevity test (4 writer & 8 Reader) for 7d 19h. After that observed this error in segment store. 2019-06-11 10:50:19,066 674163103 ...
Read more >
Segmentation Fault in Linux Containers (exit code 139)
Segmentation faults occur when a program tries to use memory that it's not allowed to access. They also arise when data is written...
Read more >
Best practices for containerizing Python applications with Docker
Use Python WSGI for production. Run containers with least possible privilege (and never as root). Handle unhealthy states of your application.
Read more >
Segmentation fault while running python in docker container
When I run the project without docker containers, it runs correctly. When I build Docker containers to install it, it does not ends...
Read more >
EKG Abnormalities
ST segment elevation is maximal in leads with tallest R waves. Note high take off of the ST segment in leads V4-6; the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found