question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] zk EventThread get block after zk leader stop

See original GitHub issue

Search before asking

  • I searched in the issues and found nothing similar.

Version

# pulsar version: 2.8.1

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
# java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
# uname  -r
3.10.0-123.el7.x86_64

Minimal reproduce step

Setup a pulsar cluster and zk cluster, could get a chance to reproduce (not always) after stop zk leader

What did you expect to see?

zk client should reconnect another zk node properly,

What did you see instead?

main-EventThread is blocking on org.apache.pulsar.metadata.impl.ZKSessionWatcher#process method.

"metadata-store-zk-session-watcher-7-1" #16 prio=5 os_prio=0 tid=0x00007fd7ca112000 nid=0x6991 waiting on condition [0x00007fd742afa000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007953debd8> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
        at org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:104)
        - locked <0x00000006c66130e0> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
        at org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$38/909132503.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

"main-EventThread" #15 daemon prio=5 os_prio=0 tid=0x00007fd7ca108800 nid=0x6990 waiting for monitor entry [0x00007fd768380000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.pulsar.metadata.impl.ZKSessionWatcher.process(ZKSessionWatcher.java:120)
        - waiting to lock <0x00000006c66130e0> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$0(ZKMetadataStore.java:75)
        at org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$35/1305486145.process(Unknown Source)
        at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.notifyEvent(ZooKeeperWatcherBase.java:180)
        at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.process(ZooKeeperWatcherBase.java:146)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:588)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)

main-EventThread log: image

metadata-store-zk-session-watcher-7-1 log: image

org.apache.pulsar.metadata.impl.ZKSessionWatcher#currentStatus is SessionLost image

zk EventThread internal task queue waitingEvents has many events to process image

The keeperState of current event that zk EventThread processing is SyncConnected image

It seems that zk exists operation will always timeout cause zk EventThread is blocking and metadata-store-zk-session-watcher thread can always acquire lock (I don’t understand why, a jvm bug?). https://github.com/apache/pulsar/blob/0866c3a6a734b39402a6bc8349bab13edab00488/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKSessionWatcher.java#L68-L71 https://github.com/apache/pulsar/blob/0866c3a6a734b39402a6bc8349bab13edab00488/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKSessionWatcher.java#L86-L108

Anything else?

No response

Are you willing to submit a PR?

  • I’m willing to submit a PR!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
Shawyeokcommented, Oct 31, 2022

I think this bug should be resolved at #17909 also.

It seems that the reported version is 2.8.1, while the fix is picked to only after 2.9.x. @codelipenghui do we have a released version including the fix that @Shawyeok can try to use?

I’ve talked with @codelipenghui, I’ll pick that PR into our release branch in my org.

Thanks again for point this out @codelipenghui @tisonkun, I’m going close this one.

1reaction
codelipenghuicommented, Oct 31, 2022

@Shawyeok The issue should be fixed by #17909

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release Notes - ZooKeeper - Version 3.5.9
ZOOKEEPER-1495 - ZK client hangs when using a function not available on the server. ZOOKEEPER-1496 - Ephemeral node not getting cleared even after...
Read more >
Bookkeeper shutdown when we stop ZK leader node - Pulsar ...
BUG REPORT Describe the bug When we stop ZK leader node , it start new elections , and ZK clients get disconnected ,...
Read more >
ZooKeeper Release Notes
[ZOOKEEPER-2775] - ZK Client not able to connect with Xid out of order ... while stopping and starting server; [ZOOKEEPER-1867] - Bug in ......
Read more >
subject:"Leader election" - The Mail Archive
Re: leader election stuck after hosts restarts ... ZkController:getLeader:1206 - Error getting leader from zk org.apache.solr.common.
Read more >
Behavior of 3 node Zookeeper quorum when 1 node fails
members to reappear to form a majority for leader re-election again. As a result, no clients can connect to the ZK service anymore,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found