Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Observed reader and SegmentContainer failure

See original GitHub issue

Observed reader failure and SegmentContainer failed with “ERROR i.p.s.s.s.StreamSegmentContainerRegistry - Critical failure for SegmentContainer Container Id = 29, State = FAILED. {} io.pravega.segmentstore.contracts.StreamingException: OperationProcessor stopped unexpectedly (no error) but DurableLog was not currently stopping” while running IO using Longevity with moderate workload (Total - 4 readers, 3 writers, ~50 events/sec, ~ 40 KB/s IO)

Observed 1 reader failure as well out of 4 during this run

INFO  [2019-06-20 06:29:21,650] io.pravega.longevity.utils.PerformanceUtils: Readers (3/4): events:475,309,330, events/sec:946, KB/sec:725.89355

Note: In this cluster Longevity IO was Running fine for ~ 5d 11h

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend

Pravega version: 0.5.0-2269.6f8a820
Zookeeper Operator : tristan1900/zookeeper:0.2
Pravega Operator: pravega/pravega-operator:0.3.2

Snip of Error:

2019-06-19 22:43:10,543 487252969 [core-23] WARN  i.p.s.s.i.bookkeeper.BookKeeperLog - Log[29]: Too many rollover failures; closing.
java.util.concurrent.CompletionException: io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:874)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at io.pravega.common.util.Retry$RetryAndThrowBase.lambda$null$3(Retry.java:214)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        ... 12 common frames omitted
Caused by: java.util.concurrent.CompletionException: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:708)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
        ... 8 common frames omitted
Caused by: io.pravega.segmentstore.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive write lock for log (path = 'pravega/pravega/segmentstore/containers/9/2/29').
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.persistMetadata(BookKeeperLog.java:802)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.updateMetadata(BookKeeperLog.java:756)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.rollover(BookKeeperLog.java:856)
        at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
        ... 9 common frames omitted
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /pravega/pravega/segmentstore/containers/9/2/29
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:2272)
        at org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:291)
        at org.apache.curator.framework.imps.SetDataBuilderImpl$4.call(SetDataBuilderImpl.java:287)
        at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
        at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
        at org.apache.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:284)
        at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:270)
        at org.apache.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:33)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.persistMetadata(BookKeeperLog.java:794)
        ... 12 common frames omitted
2019-06-19 22:43:10,543 487252969 [core-23] ERROR i.p.s.s.h.handler.AppendProcessor - Error (Segment = 'longevity/small/1.#epoch.0', Operation = 'append')
java.util.concurrent.CancellationException: BookKeeperLog has been closed.
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.lambda$close$1(BookKeeperLog.java:170)
        at java.util.ArrayList.forEach(ArrayList.java:1257)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.close(BookKeeperLog.java:170)
        at io.pravega.segmentstore.storage.impl.bookkeeper.BookKeeperLog.handleRolloverFailure(BookKeeperLog.java:146)
        at io.pravega.common.function.Callbacks.invokeSafely(Callbacks.java:54)
        at io.pravega.segmentstore.storage.impl.bookkeeper.SequentialAsyncProcessor.lambda$runInternal$0(SequentialAsyncProcessor.java:85)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at io.pravega.common.concurrent.Futures$Loop.handleException(Futures.java:729)
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2019-06-19 22:43:11,741 487254167 [core-13] ERROR i.p.s.s.s.StreamSegmentContainerRegistry - Critical failure for SegmentContainer Container Id = 29, State = FAILED. {}
io.pravega.segmentstore.contracts.StreamingException: OperationProcessor stopped unexpectedly (no error) but DurableLog was not currently stopping.
        at io.pravega.segmentstore.server.logs.DurableLog.queueStoppedHandler(DurableLog.java:405)
        at io.pravega.common.concurrent.Services$ShutdownListener.terminated(Services.java:120)
        at com.google.common.util.concurrent.AbstractService$3.call(AbstractService.java:95)
        at com.google.common.util.concurrent.AbstractService$3.call(AbstractService.java:92)
        at com.google.common.util.concurrent.ListenerCallQueue$PerListenerQueue.run(ListenerCallQueue.java:205)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2019-06-19 22:44:26,710 487329136 [core-16] INFO  i.p.s.s.h.ZKSegmentContainerMonitor - Container Changes: Desired = [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], Current = [21, 22, 23, 24, 25, 26, 27, 28, 30, 31], PendingTasks = [29], ToStart = [], ToStop = [].
2019-06-19 22:44:35,551 487337977 [epollEventLoopGroup-11-7] ERROR i.p.s.s.h.h.ServerConnectionInboundHandler - Caught exception on connection:
io.pravega.segmentstore.server.IllegalContainerStateException: Container 29 is in an invalid state for this operation. Expected: RUNNING; Actual: STARTING.

PS: complete log ~30 MB, will be sharing in slack separately.

Issue Analytics

State:
Created 4 years ago
Comments:12 (9 by maintainers)

Top GitHub Comments

1reaction

vedanthhcommented, Jun 27, 2019

@RaulGracia I have posted Isilon endpoint details in internal channel.

0reactions

fpjcommented, Jul 1, 2019

The behavior described here looks correct. The reader gets an exception after waiting for the recovery of a segment container to complete, and that particular recovery took a while. The exception does not indicate an unrecoverable error, but instead the inability of getting a response within a bounded amount of time. This will happen occasionally in production use and the application needs to be able to deal with it. How to deal with it is application-specific.

It is possible that we need to improve the recovery so that we shorten recovery time, but it does not strike me as a P0. It is also possible that recent commits fix this issue.

I’m dropping the priority and moving it to 0.6.

Top Results From Across the Web

Observed RocksDBException (I/O error, Read-only file system ...

Running longevity test (4 writer & 8 Reader) for 7d 19h. After that observed this error in segment store. 2019-06-11 10:50:19,066 674163103 ...

Segmentation Fault in Linux Containers (exit code 139)

Segmentation faults occur when a program tries to use memory that it's not allowed to access. They also arise when data is written...

Best practices for containerizing Python applications with Docker

Use Python WSGI for production. Run containers with least possible privilege (and never as root). Handle unhealthy states of your application.

Segmentation fault while running python in docker container

When I run the project without docker containers, it runs correctly. When I build Docker containers to install it, it does not ends...

EKG Abnormalities

ST segment elevation is maximal in leads with tallest R waves. Note high take off of the ST segment in leads V4-6; the...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Observed reader and SegmentContainer failure

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Readers finished with Error " io.pravega.client.segment.impl.SegmentTruncatedException"

Bookkeeper containers are keep on restarting after deploying pravega with 3 zookeepers in docker swarm setup