Brokers fail when updating bookkeeper
See original GitHub issueDescribe the bug
Pulsar brokers start to fail with following error when bookkeepers are updating (ie in rolling manner, one at a time):
03:29:25.395 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy - Unexpected exception while handling joining bookie pulsar-bookkeeper-0.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
java.lang.NullPointerException: null
at org.apache.bookkeeper.net.NetUtils.resolveNetworkLocation(NetUtils.java:77) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.resolveNetworkLocation(TopologyAwareEnsemblePlacementPolicy.java:779) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.createBookieNode(TopologyAwareEnsemblePlacementPolicy.java:775) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.handleBookiesThatJoined(TopologyAwareEnsemblePlacementPolicy.java:707) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.handleBookiesThatJoined(RackawareEnsemblePlacementPolicyImpl.java:79) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.handleBookiesThatJoined(RackawareEnsemblePlacementPolicy.java:246) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.TopologyAwareEnsemblePlacementPolicy.onClusterChanged(TopologyAwareEnsemblePlacementPolicy.java:654) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicyImpl.onClusterChanged(RackawareEnsemblePlacementPolicyImpl.java:79) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.RackawareEnsemblePlacementPolicy.onClusterChanged(RackawareEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.BookieWatcherImpl.processWritableBookiesChanged(BookieWatcherImpl.java:171) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.client.BookieWatcherImpl.lambda$initialBlockingBookieRead$1(BookieWatcherImpl.java:206) ~[org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:139) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at org.apache.bookkeeper.discover.ZKRegistrationClient$WatchTask.accept(ZKRegistrationClient.java:62) [org.apache.bookkeeper-bookkeeper-server-4.10.0.jar:4.10.0]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) [?:1.8.0_232]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) [?:1.8.0_232]
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) [?:1.8.0_232]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_232]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
03:29:28.764 [BookKeeperClientScheduler-OrderedScheduler-0-0] INFO org.apache.bookkeeper.proto.PerChannelBookieClient - Timed-out 1 operations to channel [id: 0x8f82b194, L:/10.1.27.4:35010 - R:pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local/10.1.30.30:3181] for pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
03:29:28.764 [BookKeeperClientWorker-OrderedExecutor-0-0] INFO org.apache.bookkeeper.client.PendingReadOp - Error: Bookie operation timeout while reading L1707 E0 from bookie: pulsar-bookkeeper-3.pulsar-bookkeeper.pulsar.svc.cluster.local:3181
03:29:29.426 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-62bafd6e50] Failed to subscribe to topic on /10.1.27.4:6650
03:29:29.426 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create topic reader
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup request timedout after ms 30000
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232]
at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 696 lookup request timedout after ms 30000
at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
... 10 more
03:29:29.427 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 696 timed out after 30000 ms
03:29:29.427 [pulsar-client-io-41-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create producer: 697 lookup request timedout after ms 30000
03:29:29.430 [pulsar-client-io-41-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" "Pulsar-Java-v2.5.0" 55407
03:29:29.430 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 697 timed out after 30000 ms
03:29:29.431 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck][reader-d9165e25c4] Failed to subscribe to topic on /10.1.27.4:6650
03:29:29.431 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] Failed to get create topic reader
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup request timedout after ms 30000
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_232]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_232]
at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1023) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:156) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [io.netty-netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.43.Final.jar:4.1.43.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 702 lookup request timedout after ms 30000
at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1022) ~[org.apache.pulsar-pulsar-client-original-2.5.0.jar:2.5.0]
... 10 more
03:29:29.433 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 702 timed out after 30000 ms
03:29:29.433 [pulsar-client-io-41-1] ERROR org.apache.pulsar.client.impl.ProducerImpl - [persistent://pulsar/pulsar/10.1.27.4:8080/healthcheck] [null] Failed to create producer: 703 lookup request timedout after ms 30000
03:29:29.447 [pulsar-client-io-41-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [24/Mar/2020:03:28:34 +0000] "GET /admin/v2/brokers/health HTTP/1.1" 500 2442 "-" "Pulsar-Java-v2.5.0" 54952
03:29:29.448 [pulsar-client-io-41-1] WARN org.apache.pulsar.client.impl.ClientCnx - [id: 0x449aeac5, L:/10.1.27.4:49298 - R:/10.1.27.4:6650] request 703 timed out after 30000 ms
The healthcheck mentioned is the following command:
$ /pulsar/bin/pulsar-admin brokers healthcheck
NOTE: this bug was first reported to the relevant kubernetes chart maintainers with no resolutions as of this issue.
To Reproduce
Steps to reproduce the behavior:
- Install the pulsar chart from https://github.com/kafkaesque-io/pulsar-helm-chart/tree/master/helm-chart-sources/pulsar
- Make a change to the bookkeeper statefulset to trigger a rolling update:
kubectl edit sts pulsar-bookkeeper
and add an annotation tospec.template.metadata.annotations
. - Watch the pods:
watch
kubectl get pod`
Expected behavior
Bookies should rolling update one by one with nothing else failing.
Actual behavior
Brokers start to fail and get restarted by the scheduler.
Additional context
Environment: Kubernetes cluster on GKE
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"archive", BuildDate:"2020-02-29T16:37:45Z", GoVersion:"go1.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.17", GitCommit:"bdceba0734835c6cb1acbd1c447caf17d8613b44", GitTreeState:"clean", BuildDate:"2020-01-17T23:10:13Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Frequently Asked Questions Concerning the July 30, 2013 ...
Question 12.1 (NEW, July 1, 2020): How should a broker-dealer indicate on the FOCUS Report that it is claiming an exemption from Rule...
Read more >Regulator Says Broker Audits Fail to Include Required Work
The Public Company Accounting Oversight Board said that it had found deficiencies in every audit of brokerage firms that it had reviewed.
Read more >ZooKeeper and BookKeeper administration - Apache Pulsar
BookKeeper provides persistent message storage for Pulsar. Each Pulsar broker has its own cluster of bookies. The BookKeeper cluster shares a local ZooKeeper ......
Read more >The itomdipulsar-bookkeeper pods are not accessible
kubectl -n <suite namespace> describe cm itomdipulsar-broker ... Wait for the changes to complete for one PVC and then update the next.
Read more >Understanding Accounting Errors, How to Detect and Prevent ...
An accounting error is an error in an accounting entry that was not ... Accounting errors are unintentional bookkeeping errors and are sometimes...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ncmans the root cause is on third broker -
ledger=35 - operation=Failed to read entry - entry=0
. That means that your broker failed to read an entry when the bookkeeper cluster is being updated. I am guessing that the bookie storedledger 35
was being updated. I don’t know how did you do a health check for your broker. I am guessing the health check in the helm chart wasn’t implemented in the correct way, which causes K8S failed brokers when the bookie server is temporarily unavailable.You can reference our chart https://github.com/streamnative/charts to see how we implement the readiness and liveness probes: https://github.com/streamnative/charts/blob/master/charts/pulsar/templates/broker-statefulset.yaml#L135
An additional change besides adjusting terminationGracePeriod in helm chart is required. I have created a PR https://github.com/apache/pulsar-helm-chart/pull/59 . That was inspired by @ncmans 's change in https://github.com/kafkaesque-io/pulsar-helm-chart/pull/31 .