Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Brokers fail when updating bookkeeper

See original GitHub issue

Describe the bug

Pulsar brokers start to fail with following error when bookkeepers are updating (ie in rolling manner, one at a time):

03:29:25.395 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Unexpected exception while handling joining bookie pulsar-bookkeeper-0.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
java.lang.NullPointerException: null
	at org.apache.bookkeeper.net.NetUtils.resolveNetworkLocation(NetUtils.java:77) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.resolveNetworkLocation(TopologyAwareEnsemblePlacementPolicy.java:779) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.createBookieNode(TopologyAwareEnsemblePlacementPolicy.java:775) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.handleBookiesThatJoined(TopologyAwareEnsemblePlacementPolicy.java:707) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.handleBookiesThatJoined(RackawareEnsemblePlacementPolicyImpl.java:79) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.handleBookiesThatJoined(RackawareEnsemblePlacementPolicy.java:246) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.onClusterChanged(TopologyAwareEnsemblePlacementPolicy.java:654) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.onClusterChanged(RackawareEnsemblePlacementPolicyImpl.java:79) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.onClusterChanged(RackawareEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.BookieWatcherImpl.processWritableBookiesChanged(BookieWatcherImpl.java:171) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.client.BookieWatcherImpl.lambda$initialBlockingBookieRead$1(BookieWatcherImpl.java:206) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:139) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:62) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) [?:1.8.0_232]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) [?:1.8.0_232]
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_232]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_232]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_232]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_232]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
03:29:28.764 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Timed-out 1 operations to channel [id: 0x8f82b194, L:/10.1.27.4:35010 - R:pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local/10.1.30.30:3181] for pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
03:29:28.764 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO  org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L1707 E0 from bookie: pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
03:29:29.426 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-62bafd6e50] Failed to subscribe to topic on /10.1.27.4:6650
03:29:29.426 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create topic reader
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup request timedout after ms 30000
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232]
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
	at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup request timedout after ms 30000
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
	... 10 more
03:29:29.427 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 696 timed out after 30000 ms
03:29:29.427 [pulsar-client-io-41-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create producer: 697 lookup request timedout after ms 30000
03:29:29.430 [pulsar-client-io-41-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" "Pulsar-Java-v2.5.0" 55407
03:29:29.430 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 697 timed out after 30000 ms
03:29:29.431 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-d9165e25c4] Failed to subscribe to topic on /10.1.27.4:6650
03:29:29.431 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create topic reader
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup request timedout after ms 30000
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232]
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
	at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup request timedout after ms 30000
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
	... 10 more
03:29:29.433 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 702 timed out after 30000 ms
03:29:29.433 [pulsar-client-io-41-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create producer: 703 lookup request timedout after ms 30000
03:29:29.447 [pulsar-client-io-41-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" "Pulsar-Java-v2.5.0" 54952
03:29:29.448 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 703 timed out after 30000 ms

The healthcheck mentioned is the following command:

$ /pulsar/bin/pulsar-admin brokers healthcheck

NOTE: this bug was first reported to the relevant kubernetes chart maintainers with no resolutions as of this issue.

To Reproduce

Steps to reproduce the behavior:

Install the pulsar chart from https://github.com/kafkaesque-io/pulsar-helm-chart/tree/master/helm-chart-sources/pulsar
Make a change to the bookkeeper statefulset to trigger a rolling update: kubectl edit sts pulsar-bookkeeper and add an annotation to spec.template.metadata.annotations.
Watch the pods: watch kubectl get pod`

Expected behavior

Bookies should rolling update one by one with nothing else failing.

Actual behavior

Brokers start to fail and get restarted by the scheduler.

Additional context

Environment: Kubernetes cluster on GKE

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", BuildDate:"2020-02-29T16:37:45Z", GoVersion:"go1.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.17", GitCommit:"bdceba0734835c6cb1acbd1c447caf17d8613b44", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:13Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}

Issue Analytics

State:
Created 3 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

sijiecommented, Mar 30, 2020

@ncmans the root cause is on third broker - ledger=35 - operation=Failed to read entry - entry=0. That means that your broker failed to read an entry when the bookkeeper cluster is being updated. I am guessing that the bookie stored ledger 35 was being updated. I don’t know how did you do a health check for your broker. I am guessing the health check in the helm chart wasn’t implemented in the correct way, which causes K8S failed brokers when the bookie server is temporarily unavailable.

You can reference our chart https://github.com/streamnative/charts to see how we implement the readiness and liveness probes: https://github.com/streamnative/charts/blob/master/charts/pulsar/templates/broker-statefulset.yaml#L135

0reactions

lhotaricommented, Aug 27, 2020

An additional change besides adjusting terminationGracePeriod in helm chart is required. I have created a PR https://github.com/apache/pulsar-helm-chart/pull/59 . That was inspired by @ncmans 's change in https://github.com/kafkaesque-io/pulsar-helm-chart/pull/31 .

Top Results From Across the Web

Frequently Asked Questions Concerning the July 30, 2013 ...

Question 12.1 (NEW, July 1, 2020): How should a broker-dealer indicate on the FOCUS Report that it is claiming an exemption from Rule...

Regulator Says Broker Audits Fail to Include Required Work

The Public Company Accounting Oversight Board said that it had found deficiencies in every audit of brokerage firms that it had reviewed.

ZooKeeper and BookKeeper administration - Apache Pulsar

BookKeeper provides persistent message storage for Pulsar. Each Pulsar broker has its own cluster of bookies. The BookKeeper cluster shares a local ZooKeeper ......

The itomdipulsar-bookkeeper pods are not accessible

kubectl -n <suite namespace> describe cm itomdipulsar-broker ... Wait for the changes to complete for one PVC and then update the next.

Understanding Accounting Errors, How to Detect and Prevent ...

An accounting error is an error in an accounting entry that was not ... Accounting errors are unintentional bookkeeping errors and are sometimes...