Topic Operator failing to start with io.vertx.core.VertxException: Thread blocked
See original GitHub issueDescribe the bug When deploying a very simple cluster with the topicOperator enabled, the topicOperator container fails to start. The logs for the container report a blocked thread. The k8s liveness check eventually kills the container.
2021-12-16 00:16:50,79115 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 2542 ms, time limit is 2000 ms
2021-12-16 00:16:51,79090 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 3542 ms, time limit is 2000 ms
2021-12-16 00:16:52,79034 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 4541 ms, time limit is 2000 ms
2021-12-16 00:16:53,79105 WARN [vertx-blocked-thread-checker] BlockedThreadChecker: - Thread Thread[vert.x-eventloop-thread-0,5,main] has been blocked for 5542 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
at jdk.internal.misc.Unsafe.park(Native Method) ~[?:?]
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) ~[?:?]
at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1796) ~[?:?]
at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3128) ~[?:?]
at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1823) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1998) ~[?:?]
at io.apicurio.registry.utils.ConcurrentUtil.get(ConcurrentUtil.java:35) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
at io.apicurio.registry.utils.ConcurrentUtil.get(ConcurrentUtil.java:27) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
at io.apicurio.registry.utils.ConcurrentUtil.result(ConcurrentUtil.java:54) ~[io.apicurio.apicurio-registry-common-1.3.2.Final.jar:?]
at io.strimzi.operator.topic.Session.lambda$start$9(Session.java:198) ~[io.strimzi.topic-operator-0.26.0.jar:0.26.0]
at io.strimzi.operator.topic.Session$$Lambda$278/0x0000000840319840.handle(Unknown Source) ~[?:?]
at io.vertx.core.impl.future.FutureImpl$3.onSuccess(FutureImpl.java:141) ~[io.vertx.vertx-core-4.1.5.jar:4.1.5]
at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:54) ~[io.vertx.vertx-core-4.1.5.jar:4.1.5]
at io.vertx.core.impl.future.FutureBase$$Lambda$293/0x000000084031e040.run(Unknown Source) ~[?:?]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[io.netty.netty-transport-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.68.Final.jar:4.1.68.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
To Reproduce Steps to reproduce the behavior:
- Install Strimzi Operator using the 0.26.0 helm chart
- Create a Cluster manifest:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: kafka-basic
spec:
kafka:
version: 3.0.0
replicas: 1
listeners:
- name: plain
port: 9092
type: internal
tls: false
storage:
type: ephemeral
zookeeper:
replicas: 1
storage:
type: ephemeral
entityOperator:
topicOperator: {}
userOperator: {}
- Apply the manifest with
kubectl apply -f kafka-basic.yaml
- Watch the topic operator logs with
kubectl logs deploy/kafka-basic-entity-operator -c topic-operator
Expected behavior The topic operator starts correctly.
Environment:
- Strimzi version: 0.26.0
- Installation method: Helm chart
- Kubernetes cluster: Kubernetes 1.20.7
- Infrastructure: Amazon EKS
YAML files and logs Thanks for the handy script! report-16-12-2021_11-26-59.zip
Additional context Similar errors show up in these issues: https://github.com/strimzi/strimzi-kafka-operator/issues/383 https://github.com/strimzi/strimzi-kafka-operator/issues/1050 https://github.com/strimzi/strimzi-kafka-operator/issues/4964
Increasing the resource claims for the topic operator didn’t change the behaviour.
Zookeeper doesn’t show any errors or timeouts.
Issue Analytics
- State:
- Created 2 years ago
- Comments:23 (9 by maintainers)
Also running into this.
For the time being, I am defaulting back to zookeeper store instead of kafka streams store by doing the following
Using ZK for now is fine, but as you note ZK will eventually disappear. So I guess overriding is fine in the short term.