creat ledger timeout and thread BookKeeperClientWorker in stuck
See original GitHub issueDescribe the bug we found many timeout logs when creating producer or sending message.like this,we creat a new topic and then build a producer to send message:
java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
we check the stack trace of create ledger and found that when creating ledger complete it can’t get the thread BookKeeperClientWorker
to do callback’s work
we dump tread stack, BookKeeperClientWorker-OrderedExecutor-0-0
is always BLOCKED
"BookKeeperClientScheduler-OrderedScheduler-0-0" #26 prio=5 os_prio=0 cpu=61864.97ms elapsed=268089.10s tid=0x00007f88598ef800 nid=0x17b in Object.wait() [0x00007f87fa5e1000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@11.0.12/Native Method)
- waiting on <no object reference available>
at java.lang.Object.wait(java.base@11.0.12/Unknown Source)
at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:275)
- waiting to re-lock in wait() <0x00000000cb13f9c0> (a io.netty.channel.DefaultChannelPromise)
at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
at org.apache.bookkeeper.proto.PerChannelBookieClient.closeInternal(PerChannelBookieClient.java:1081)
at org.apache.bookkeeper.proto.PerChannelBookieClient.disconnect(PerChannelBookieClient.java:1034)
at org.apache.bookkeeper.proto.PerChannelBookieClient.disconnect(PerChannelBookieClient.java:1029)
at org.apache.bookkeeper.proto.PerChannelBookieClient.failTLS(PerChannelBookieClient.java:2555)
- locked <0x00000000b8293218> (a org.apache.bookkeeper.proto.PerChannelBookieClient)
at org.apache.bookkeeper.proto.PerChannelBookieClient.access$2900(PerChannelBookieClient.java:154)
at org.apache.bookkeeper.proto.PerChannelBookieClient$StartTLSCompletion.errorOut(PerChannelBookieClient.java:1961)
at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue.timeout(PerChannelBookieClient.java:1614)
at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue.maybeTimeout(PerChannelBookieClient.java:1606)
at org.apache.bookkeeper.proto.PerChannelBookieClient.lambda$static$3(PerChannelBookieClient.java:1011)
at org.apache.bookkeeper.proto.PerChannelBookieClient$$Lambda$596/0x00000001006f8440.test(Unknown Source)
at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap$Section.removeIf(ConcurrentOpenHashMap.java:411)
at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap.removeIf(ConcurrentOpenHashMap.java:172)
at org.apache.bookkeeper.proto.PerChannelBookieClient.checkTimeoutOnPendingOperations(PerChannelBookieClient.java:1015)
at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.checkTimeoutOnPendingOperations(DefaultPerChannelBookieClientPool.java:132)
at org.apache.bookkeeper.proto.BookieClientImpl.monitorPendingOperations(BookieClientImpl.java:572)
at org.apache.bookkeeper.proto.BookieClientImpl.lambda$new$0(BookieClientImpl.java:131)
at org.apache.bookkeeper.proto.BookieClientImpl$$Lambda$163/0x000000010038e840.run(Unknown Source)
at org.apache.bookkeeper.util.SafeRunnable$1.safeRun(SafeRunnable.java:43)
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:705)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.12/Unknown Source)
at java.util.concurrent.FutureTask.runAndReset(java.base@11.0.12/Unknown Source)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.12/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.12/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.12/Unknown Source)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.12/Unknown Source)
"BookKeeperClientWorker-OrderedExecutor-0-0" #27 prio=5 os_prio=0 cpu=616668.69ms elapsed=268089.10s tid=0x00007f88598f3000 nid=0x17c waiting for monitor entry [0x00007f87fa4e0000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.bookkeeper.proto.PerChannelBookieClient.connectIfNeededAndDoOp(PerChannelBookieClient.java:631)
- waiting to lock <0x00000000b8293218> (a org.apache.bookkeeper.proto.PerChannelBookieClient)
at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:121)
at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:116)
at org.apache.bookkeeper.proto.BookieClientImpl.addEntry(BookieClientImpl.java:329)
at org.apache.bookkeeper.client.PendingAddOp.sendWriteRequest(PendingAddOp.java:152)
at org.apache.bookkeeper.client.PendingAddOp.safeRun(PendingAddOp.java:278)
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.12/Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.12/Unknown Source)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.12/Unknown Source)
To Reproduce i don’t known how to reproduce
Additional context version:2.9.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Pulsar configuration
Name Description Default
exposePublisherStats Whether to enable topic level metrics. true
statsUpdateFrequencyInSecs 60
statsUpdateInitialDelayInSecs 60
Read more >[bookkeeper] 05/11: Add auditor get ledger throttle to avoid auto ...
This is an automated email from the ASF dual-hosted git repository. yong pushed a commit to branch branch-4.14 in repository ...
Read more >Client Optimization: How Tencent Maintains Apache Pulsar ...
The pulsar-io thread gets stuck; Excessive time consumption in ledger switching; Busy bookkeeper-io thread; Debug logging impact; Uneven ...
Read more >NetSuite Applications Suite - Journal Entries
NetSuite enforces double-entry bookkeeping. Therefore, journal entries post changes to accounts using offsetting debits and credits.
Read more >org.apache.bookkeeper.benchmark.BenchThroughputLatency.java ...
Here is the source code for org.apache.bookkeeper.benchmark. ... LedgerHandle; import org.apache.bookkeeper.client. ... nanoTime(); while (!Thread.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Since upgrading from 2.7.1 to 2.9.1 we also have been hitting issues where our producers will begin failing. Typically we’ll have one broker that isn’t responding to requests and restarting it will fix the issue. From looking at the broker logs we see a similar error to what is describe in this issue:
2022-02-18T17:35:06,772+0000 [pulsar-io-4-7] WARN org.apache.pulsar.broker.service.ServerCnx - [/10.100.209.43:51632][persistent://prod/voltron-general/871_2b2d84be150dcf9c_MAID_DELETE_6333758_4bb66664126194f7-partition-0][voltron] Failed to create consumer: consumerId=0, Failed to load topic within timeout java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout ... at org.apache.pulsar.common.util.FutureUtil.lambda$addTimeoutHandling$1(FutureUtil.java:141) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] ... Caused by: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout
This seems to be happening randomly every 4-7 hours since we upgraded. We typically write to a lot of topics in a given namespace. We’ll try to capture thread state next time it happens.
The issue had no activity for 30 days, mark with Stale label.