[Bug] zk EventThread get block after zk leader stop
See original GitHub issueSearch before asking
- I searched in the issues and found nothing similar.
Version
# pulsar version: 2.8.1
# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
# java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
# uname -r
3.10.0-123.el7.x86_64
Minimal reproduce step
Setup a pulsar cluster and zk cluster, could get a chance to reproduce (not always) after stop zk leader
What did you expect to see?
zk client should reconnect another zk node properly,
What did you see instead?
main-EventThread
is blocking on org.apache.pulsar.metadata.impl.ZKSessionWatcher#process
method.
"metadata-store-zk-session-watcher-7-1" #16 prio=5 os_prio=0 tid=0x00007fd7ca112000 nid=0x6991 waiting on condition [0x00007fd742afa000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000007953debd8> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
at org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:104)
- locked <0x00000006c66130e0> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
at org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$38/909132503.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
"main-EventThread" #15 daemon prio=5 os_prio=0 tid=0x00007fd7ca108800 nid=0x6990 waiting for monitor entry [0x00007fd768380000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.pulsar.metadata.impl.ZKSessionWatcher.process(ZKSessionWatcher.java:120)
- waiting to lock <0x00000006c66130e0> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$0(ZKMetadataStore.java:75)
at org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$35/1305486145.process(Unknown Source)
at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.notifyEvent(ZooKeeperWatcherBase.java:180)
at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.process(ZooKeeperWatcherBase.java:146)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:588)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)
main-EventThread
log:
metadata-store-zk-session-watcher-7-1
log:
org.apache.pulsar.metadata.impl.ZKSessionWatcher#currentStatus
is SessionLost
zk EventThread internal task queue waitingEvents
has many events to process
The keeperState of current event that zk EventThread processing is SyncConnected
It seems that zk exists operation will always timeout cause zk EventThread is blocking and metadata-store-zk-session-watcher
thread can always acquire lock (I don’t understand why, a jvm bug?).
https://github.com/apache/pulsar/blob/0866c3a6a734b39402a6bc8349bab13edab00488/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKSessionWatcher.java#L68-L71
https://github.com/apache/pulsar/blob/0866c3a6a734b39402a6bc8349bab13edab00488/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKSessionWatcher.java#L86-L108
Anything else?
No response
Are you willing to submit a PR?
- I’m willing to submit a PR!
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
I’ve talked with @codelipenghui, I’ll pick that PR into our release branch in my org.
Thanks again for point this out @codelipenghui @tisonkun, I’m going close this one.
@Shawyeok The issue should be fixed by #17909