question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Java client] Deadlock in Pulsar Client when running ConsumerBatchReceiveTest

See original GitHub issue

Describe the bug

There’s a deadlock issue in Pulsar Client in master branch. A PR test run had stalled and the thread dump detected this deadlock issue:

Found one Java-level deadlock:
=============================
"pulsar-timer-462-1":
  waiting to lock monitor 0x00007fce080ad180 (object 0x00000000c6094a00, a org.apache.pulsar.client.impl.ConsumerImpl),
  which is held by "pulsar-client-internal-459-1"
"pulsar-client-internal-459-1":
  waiting for ownable synchronizer 0x00000000c6094bf0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "pulsar-timer-462-1"

Java stack information for the threads listed above:
===================================================
"pulsar-timer-462-1":
        at org.apache.pulsar.client.impl.ConsumerImpl.redeliverUnacknowledgedMessages(ConsumerImpl.java:1578)
        - waiting to lock <0x00000000c6094a00> (a org.apache.pulsar.client.impl.ConsumerImpl)
        at org.apache.pulsar.client.impl.ConsumerImpl.redeliverUnacknowledgedMessages(ConsumerImpl.java:1619)
        at org.apache.pulsar.client.impl.UnAckedMessageTracker$2.run(UnAckedMessageTracker.java:145)
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
        at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
"pulsar-client-internal-459-1":
        at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
        - parking to wait for  <0x00000000c6094bf0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.11/AbstractQueuedSynchronizer.java:885)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.11/AbstractQueuedSynchronizer.java:917)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.11/AbstractQueuedSynchronizer.java:1240)
        at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.11/ReentrantReadWriteLock.java:959)
        at org.apache.pulsar.client.impl.UnAckedMessageTracker.add(UnAckedMessageTracker.java:180)
        at org.apache.pulsar.client.impl.ConsumerImpl.trackMessage(ConsumerImpl.java:1385)
        at org.apache.pulsar.client.impl.ConsumerImpl.trackMessage(ConsumerImpl.java:1369)
        at org.apache.pulsar.client.impl.ConsumerImpl.messageProcessed(ConsumerImpl.java:1362)
        - locked <0x00000000c6094a00> (a org.apache.pulsar.client.impl.ConsumerImpl)
        at org.apache.pulsar.client.impl.ConsumerImpl.lambda$internalBatchReceiveAsync$5(ConsumerImpl.java:483)
        at org.apache.pulsar.client.impl.ConsumerImpl$$Lambda$1271/0x0000000100ac0c40.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.11/Executors.java:515)
        at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.11/ScheduledThreadPoolExecutor.java:304)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

Found 1 deadlock.

Full thread dump: https://gist.github.com/lhotari/1bbcc43e850bd7d62891ba7fe3724b0b thread dump in jstack.review UI: https://jstack.review/?https://gist.github.com/lhotari/1bbcc43e850bd7d62891ba7fe3724b0b#tda_1_dump

The test that was executing was ConsumerBatchReceiveTest:

"main" #1 prio=5 os_prio=0 cpu=13468.61ms elapsed=6534.24s tid=0x00007fce64027800 nid=0xca9 in Object.wait()  [0x00007fce6a11e000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(java.base@11.0.11/Native Method)
        - waiting on <no object reference available>
        at java.lang.Thread.join(java.base@11.0.11/Thread.java:1308)
        - waiting to re-lock in wait() <0x00000000c4246738> (a io.netty.util.concurrent.FastThreadLocalThread)
        at io.netty.util.HashedWheelTimer.stop(HashedWheelTimer.java:383)
        at org.apache.pulsar.client.impl.PulsarClientImpl.shutdown(PulsarClientImpl.java:730)
        at org.apache.pulsar.broker.auth.MockedPulsarServiceBaseTest.internalCleanup(MockedPulsarServiceBaseTest.java:192)
        at org.apache.pulsar.client.api.ConsumerBatchReceiveTest.cleanup(ConsumerBatchReceiveTest.java:48)
        at org.apache.pulsar.tests.TestRetrySupport.stateCheck(TestRetrySupport.java:52)
        at jdk.internal.reflect.GeneratedMethodAccessor129.invoke(Unknown Source)
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.11/DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(java.base@11.0.11/Method.java:566)
        at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
        at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61)
        at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366)
        at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320)
        at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701)
        at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:527)
        at org.testng.internal.TestInvoker.retryFailed(TestInvoker.java:214)
        at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:58)
        at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
        at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
        at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
        at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
        at org.testng.TestRunner$$Lambda$219/0x0000000100448c40.accept(Unknown Source)
        at java.util.ArrayList.forEach(java.base@11.0.11/ArrayList.java:1541)
        at org.testng.TestRunner.privateRun(TestRunner.java:764)
        at org.testng.TestRunner.run(TestRunner.java:585)
        at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
        at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
        at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
        at org.testng.SuiteRunner.run(SuiteRunner.java:286)
        at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
        at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
        at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
        at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
        at org.testng.TestNG.runSuites(TestNG.java:1069)
        at org.testng.TestNG.run(TestNG.java:1037)
        at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
        at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
        at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
        at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

Expected behavior

Pulsar Client shouldn’t deadlock.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
eolivellicommented, Jun 1, 2021

@lhotari I see you marked this as “release/blocker”

I would consider this a “blocker for a release” if it is a regression, compared to latest version. The rule of thumb is that a new release must not be worse than the previous. It is usually not a problem to cut a release if there is a bad problem. The new release still contains lots of improvements and fixes, so no need to “block” it for a new “ugly problem”.

Do you think that this is a regression ?

(please note that my comment is more general, not directly about this issue, I am fine with fixing it for 2.8, but if it is not a regression then I would not consider it a blocker)

0reactions
eolivellicommented, Jun 1, 2021

I am also investigating about a simpler case:

  • send 1.000.000 messages to a 100 partitions topic (simple pulsar cluster with 3bookies, 1broker, 1 proxy, 3x3x3 replication)
  • consume those 1.000.000 messages (shared subscription)

The consumer is stuck after receiving 80% of the message. I will open a separate ticket

Read more comments on GitHub >

github_iconTop Results From Across the Web

Thread deadlock in stress test · Issue #14443 · apache/pulsar
java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.pulsar.client.impl.ProducerImpl.run(ProducerImpl.java:1552)
Read more >
Pulsar Java client
You can use a Pulsar Java client to create the Java producer, consumer, reader and TableView of messages and to perform administrative tasks....
Read more >
Client Optimization: How Tencent Maintains Apache Pulsar ...
He is focused on big data and message middleware, with over 10 years of experience in Java development. Dawei Zhang, Apache Pulsar Committer, ......
Read more >
Getting Started with the Apache Pulsar Java Client - Medium
Get hands-on with Apache Pulsar's Java client and create a message ... To start a Pulsar standalone container, open a terminal and run...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found