question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Broker suddenly goes down

See original GitHub issue

Recently, broker goes down occasionally in our some clusters. The following is an excerpt from log of the broker that went down.

13:30:11.464 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 25 seconds
13:30:13.464 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 23 seconds
13:30:15.464 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 21 seconds
13:30:17.464 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 19 seconds
13:30:19.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 17 seconds
13:30:21.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 15 seconds
13:30:23.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 13 seconds
13:30:25.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 11 seconds
13:30:27.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 8 seconds
13:30:29.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 6 seconds
13:30:31.465 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 4 seconds
13:30:33.466 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 2 seconds
13:30:35.466 [pulsar-zk-session-watcher-12-1] WARN  o.a.p.z.ZooKeeperSessionWatcher      - zoo keeper disconnected, waiting to reconnect, time remaining = 0 seconds
13:30:37.466 [pulsar-zk-session-watcher-12-1] ERROR o.a.p.z.ZooKeeperSessionWatcher      - timeout expired for reconnecting, invoking shutdown service
13:30:37.467 [pulsar-zk-session-watcher-12-1] INFO  org.apache.zookeeper.ZooKeeper       - Session: 0x164f333639f0269 closed
13:30:37.467 [pulsar-zk-session-watcher-12-1] INFO  o.a.p.b.MessagingServiceShutdownHook - Invoking Runtime.halt(-1)

The broker service was shutdown since it could not reconnect to ZK for a long time. However, all ZK servers seemed to be working normally at that time.

Does someone know this cause?

System configuration

  • Cluster-A

    • Pulsar version: 1.22.1-incubating
    • ZK version: 3.4.10
  • Cluster-B

    • Pulsar version: 2.0.1-incubating
    • ZK version: 3.4.10

ZK settings:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/usr/local/var/pulsar-zookeeper
clientPort=2181
maxClientCnxns=0
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
server.1=xxxx:2182:2183
server.2=xxxx:2182:2183
server.3=xxxx:2182:2183
server.4=xxxx:2182:2183
server.5=xxxx:2182:2183

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
massakamcommented, Aug 11, 2018

Updates:

  • This phenomenon occured only once in v2.0.1, but occured many times in v1.22.1. So each cause may be different.
  • v1.22.1 broker goes down when splitting and unloading a bundle. v2.0.1 and v1.21.0 do not go down.
  • The following is a thread dump right before v1.22.1 broker goes down. threaddump.txt
  • This phenomenon does not occur if v1.22.1 is modified as follows:
--- a/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java
+++ b/pulsar-broker/src/main/java/org/apache/pulsar/broker/namespace/NamespaceService.java
@@ -22,6 +22,7 @@ import static com.google.common.base.Preconditions.checkArgument;
 import static com.google.common.base.Preconditions.checkNotNull;
 import static java.lang.String.format;
 import static java.util.concurrent.TimeUnit.SECONDS;
+import static org.apache.bookkeeper.mledger.util.SafeRun.safeRun;
 import static org.apache.pulsar.broker.cache.LocalZooKeeperCacheService.LOCAL_POLICIES_ROOT;
 import static org.apache.pulsar.broker.web.PulsarWebResource.joinPath;
 import static org.apache.pulsar.common.naming.NamespaceBundleFactory.getBundlesData;
@@ -596,7 +597,7 @@ public class NamespaceService {
                     checkNotNull(ownershipCache.tryAcquiringOwnership(sBundle));
                 }
                 updateNamespaceBundles(nsname, splittedBundles.getLeft(),
-                    (rc, path, zkCtx, stat) ->  {
+                    (rc, path, zkCtx, stat) -> pulsar.getOrderedExecutor().submit(safeRun(() -> {
                         if (rc == Code.OK.intValue()) {
                             // invalidate cache as zookeeper has new split
                             // namespace bundle
@@ -618,7 +619,7 @@ public class NamespaceService {
                             LOG.warn(msg);
                             updateFuture.completeExceptionally(new ServiceUnitNotReadyException(msg));
                         }
-                    });
+                    })));
             } catch (Exception e) {
                 String msg = format("failed to acquire ownership of split bundle for namespace [%s], %s",
                     nsname.toString(), e.getMessage());

I think that recurrence of this bug by the following two changes is the cause of v1.22.1 broker going down.

0reactions
sijiecommented, Aug 15, 2018

@massakam gotcha. thank you for clarification.

Read more comments on GitHub >

github_iconTop Results From Across the Web

If a Brokerage Firm Closes Its Doors | FINRA.org
When you open an account with a brokerage firm that is a carrying firm, the firm not only handles your orders to buy...
Read more >
Get funds back if your broker suddenly disappears
The most common type of scam activities is Brokers disappearing suddenly. If you are a victim, Scamsurvivor can help.
Read more >
Why did my broker close my position without my consent?
If you have several open positions on your account in MetaTrader 4, the position with the highest level of floating losses will be...
Read more >
Will Your Brokerage Account Be Wiped Out When the Market ...
If the stock price falls to $10 per share, your investment is suddenly worth $1,000. But you haven't really lost anything. You only...
Read more >
Interactive Brokers down? Current outages and problems
Real-time outages for Interactive Brokers. Is the site down? Can't log in to your account and trade equities? Here you see what is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found