Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

cluster operator generated new certificats no reason

See original GitHub issue

We used strimzi 0.17.2 on openshift v3.11 with trident for the storage.

strimzi.io/kind=Kafka strimzi.io/cluster=cluster-kafka-persistent

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: cluster-kafka-persistent
spec:
  kafka:
    authorization:
      type: simple
    version: 2.4.0
    replicas: 5
    listeners:
      external:
        authentication:
          type: scram-sha-512
        type: route
    config:
      offsets.topic.replication.factor: 5
      transaction.state.log.replication.factor: 5
      transaction.state.log.min.isr: 3
      log.message.format.version: "2.4"
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
      class: backend-silver

  zookeeper:
    replicas: 5
    storage:
      type: persistent-claim
      size: 100Gi
      deleteClaim: false
      class: backend-silver

  clusterCa:
    generateCertificateAuthority: true
    validityDays: 1460

  clientsCa:
    generateCertificateAuthority: true
    validityDays: 1460

  entityOperator:
    topicOperator: {}
    userOperator: {}

We fixed renew certifcats to 1460 days. 3 days ago without reason cluster-operator re-generated this secrets:

cluster-kafka-persistent-clients-ca
cluster-kafka-persistent-clients-ca-cert
cluster-kafka-persistent-cluster-ca
cluster-kafka-persistent-cluster-ca-cert

We ended up in this state:

[manage_strimzi]# oc get all
NAME                                           READY     STATUS             RESTARTS   AGE
pod/cluster-kafka-persistent-zookeeper-0       1/2       CrashLoopBackOff   812        2d
pod/cluster-kafka-persistent-zookeeper-1       1/2       CrashLoopBackOff   802        2d
pod/cluster-kafka-persistent-zookeeper-2       1/2       CrashLoopBackOff   812        2d
pod/cluster-kafka-persistent-zookeeper-3       1/2       CrashLoopBackOff   813        2d
pod/cluster-kafka-persistent-zookeeper-4       1/2       CrashLoopBackOff   803        2d
pod/strimzi-cluster-operator-d5b6c6458-fpbqx   1/1       Running            146        19d

NAME                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/cluster-kafka-persistent-zookeeper-client   ClusterIP   172.30.253.229   <none>        2181/TCP                     2d
service/cluster-kafka-persistent-zookeeper-nodes    ClusterIP   None             <none>        2181/TCP,2888/TCP,3888/TCP   2d

NAME                                       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/strimzi-cluster-operator   1         1         1            1           225d

NAME                                                 DESIRED   CURRENT   READY     AGE
replicaset.apps/strimzi-cluster-operator-d5b6c6458   1         1         1         167d

NAME                                                  DESIRED   CURRENT   AGE
statefulset.apps/cluster-kafka-persistent-zookeeper   5         5         2d
[manage_strimzi]#_

Then we try relaunched pod after stopped with this command

oc scale deployment.apps/strimzi-cluster-operator --replicas=0 -n kafka-uat
oc delete statefulset.apps/cluster-kafka-persistent-kafka statefulset.apps/cluster-kafka-persistent-zookeeper -n kafka-uat
oc scale deployment.apps/strimzi-cluster-operator --replicas=1 -n kafka-uat

We did not manage to start the pod The cluster-operator pod gave this message:

2020-09-24 09:40:40 ERROR AbstractOperator:124 - Reconciliation #1(timer) Kafka(kafka-uat/cluster-kafka-persistent): createOrUpdate failed
java.lang.NullPointerException: null
at io.strimzi.operator.cluster.model.ModelUtils.buildSecret(ModelUtils.java:248) ~[io.strimzi.cluster-operator-0.17.0.jar:0.17.0]
at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator$ReconciliationState.clusterOperatorSecret(KafkaAssemblyOperator.java:3187) ~[io.strimzi.cluster-operator-0.17.0.jar:0.17.0]
at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator.lambda$reconcile$3(KafkaAssemblyOperator.java:254) ~[io.strimzi.cluster-operator-0.17.0.jar:0.17.0]
at io.vertx.core.Future.lambda$compose$3(Future.java:360) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:107) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:152) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:113) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:178) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:21) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:107) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:152) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:113) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:178) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:21) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:330) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]
at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:369) ~[io.vertx.vertx-core-3.8.5.jar:3.8.5]

I had no other solution than to delete everything and reinstall. do you have any idea why this behavior? And I note cluster-operator restarted a lot of time perhaps there are a memory leak on this pod ?

PS: I will increase memories for JVM for cluster-operator pod.

Best regards, Toty

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

scholzjcommented, Sep 28, 2020

Looks like for whatever reason the secret with the certificate got deleted / damaged. Do you have a fill log from the cluster operator for when it happened? (ideally on DEBUG level, but even without it it might be helpful) The NullPointerException is probably just a followup rather than the actual cause.

0reactions

totee19commented, Sep 30, 2020

I know it’s difficult when we don’t have a log file. I hope it not happen again. Thank for your support.

Top Results From Across the Web

Troubleshooting Operator issues - OpenShift Documentation

Default OpenShift Container Platform cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a Subscription ...

Cluster network operator pod 's internal webhook exposes an ...

Cluster network operator pod 's internal webhook exposes an API which certificate could eventually expire. - Red Hat Customer Portal.

Issues with Certificate manager (cert-manager) while upgrading

Resolving the problem. Ensure you delete the resources created by the previous Certificate manager (cert-manager) to allow the operator to create new resources....

Update security certificates with a different CA | Elasticsearch ...

On any node in your cluster, generate a new CA certificate. You only need to complete this step one time. If you're using...

Certificate management | CockroachDB Docs

How to authenticate a secure 3-node CockroachDB cluster with Kubernetes. ... By default, the Operator will generate and sign 1 client and 1...