cluster operator generated new certificats no reason
See original GitHub issueWe used strimzi 0.17.2 on openshift v3.11 with trident for the storage.
strimzi.io/kind=Kafka strimzi.io/cluster=cluster-kafka-persistent
apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
name: cluster-kafka-persistent
spec:
kafka:
authorization:
type: simple
version: 2.4.0
replicas: 5
listeners:
external:
authentication:
type: scram-sha-512
type: route
config:
offsets.topic.replication.factor: 5
transaction.state.log.replication.factor: 5
transaction.state.log.min.isr: 3
log.message.format.version: "2.4"
storage:
type: persistent-claim
size: 100Gi
deleteClaim: false
class: backend-silver
zookeeper:
replicas: 5
storage:
type: persistent-claim
size: 100Gi
deleteClaim: false
class: backend-silver
clusterCa:
generateCertificateAuthority: true
validityDays: 1460
clientsCa:
generateCertificateAuthority: true
validityDays: 1460
entityOperator:
topicOperator: {}
userOperator: {}
We fixed renew certifcats to 1460 days. 3 days ago without reason cluster-operator re-generated this secrets:
- cluster-kafka-persistent-clients-ca
- cluster-kafka-persistent-clients-ca-cert
- cluster-kafka-persistent-cluster-ca
- cluster-kafka-persistent-cluster-ca-cert
We ended up in this state:
[manage_strimzi]# oc get all
NAME READY STATUS RESTARTS AGE
pod/cluster-kafka-persistent-zookeeper-0 1/2 CrashLoopBackOff 812 2d
pod/cluster-kafka-persistent-zookeeper-1 1/2 CrashLoopBackOff 802 2d
pod/cluster-kafka-persistent-zookeeper-2 1/2 CrashLoopBackOff 812 2d
pod/cluster-kafka-persistent-zookeeper-3 1/2 CrashLoopBackOff 813 2d
pod/cluster-kafka-persistent-zookeeper-4 1/2 CrashLoopBackOff 803 2d
pod/strimzi-cluster-operator-d5b6c6458-fpbqx 1/1 Running 146 19d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cluster-kafka-persistent-zookeeper-client ClusterIP 172.30.253.229 <none> 2181/TCP 2d
service/cluster-kafka-persistent-zookeeper-nodes ClusterIP None <none> 2181/TCP,2888/TCP,3888/TCP 2d
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deployment.apps/strimzi-cluster-operator 1 1 1 1 225d
NAME DESIRED CURRENT READY AGE
replicaset.apps/strimzi-cluster-operator-d5b6c6458 1 1 1 167d
NAME DESIRED CURRENT AGE
statefulset.apps/cluster-kafka-persistent-zookeeper 5 5 2d
[manage_strimzi]#_
Then we try relaunched pod after stopped with this command
oc scale deployment.apps/strimzi-cluster-operator --replicas=0 -n kafka-uat
oc delete statefulset.apps/cluster-kafka-persistent-kafka statefulset.apps/cluster-kafka-persistent-zookeeper -n kafka-uat
oc scale deployment.apps/strimzi-cluster-operator --replicas=1 -n kafka-uat
We did not manage to start the pod The cluster-operator pod gave this message:
2020-09-24 09:40:40 ERROR AbstractOperator:124 - Reconciliation #1(timer) Kafka(kafka-uat/cluster-kafka-persistent): createOrUpdate failed
java.lang.NullPointerException: null
at io.strimzi.operator.cluster.model.ModelUtils.buildSecret(ModelUtils.java:248) ~[io.strimzi.cluster-operator-0.17.0.jar:0.17.0]
at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator$ReconciliationState.clusterOperatorSecret(KafkaAssemblyOperator.java:3187) ~[io.strimzi.cluster-operator-0.17.0.jar:0.17.0]
at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator.lambda$reconcile$3(KafkaAssemblyOperator.java:254) ~[io.strimzi.cluster-operator-0.17.0.jar:0.17.0]
at io.vertx.core.Future.lambda$compose$3(Future.java:360) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:107) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:152) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:113) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:178) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:21) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:107) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:152) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:113) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:178) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:21) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:330) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:369) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
I had no other solution than to delete everything and reinstall. do you have any idea why this behavior? And I note cluster-operator restarted a lot of time perhaps there are a memory leak on this pod ?
PS: I will increase memories for JVM for cluster-operator pod.
Best regards, Toty
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Troubleshooting Operator issues - OpenShift Documentation
Default OpenShift Container Platform cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription ...
Read more >Cluster network operator pod 's internal webhook exposes an ...
Cluster network operator pod 's internal webhook exposes an API which certificate could eventually expire. - Red Hat Customer Portal.
Read more >Issues with Certificate manager (cert-manager) while upgrading
Resolving the problem. Ensure you delete the resources created by the previous Certificate manager (cert-manager) to allow the operator to create new resources....
Read more >Update security certificates with a different CA | Elasticsearch ...
On any node in your cluster, generate a new CA certificate. You only need to complete this step one time. If you're using...
Read more >Certificate management | CockroachDB Docs
How to authenticate a secure 3-node CockroachDB cluster with Kubernetes. ... By default, the Operator will generate and sign 1 client and 1...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Looks like for whatever reason the secret with the certificate got deleted / damaged. Do you have a fill log from the cluster operator for when it happened? (ideally on DEBUG level, but even without it it might be helpful) The NullPointerException is probably just a followup rather than the actual cause.
I know it’s difficult when we don’t have a log file. I hope it not happen again. Thank for your support.