question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Periodic failure to reconcile a difference in Service .spec.internalTrafficPolicy

See original GitHub issue

Describe the bug

The issue looks similar to https://github.com/strimzi/strimzi-kafka-operator/issues/2691.

  1. As of Kubernetes 1.23, service resources have .spec.internalTrafficPolicy set to Cluster by default (reference). When reconciling KafkaConnect resources, Strimizi continuously attempts to unset the default value and fails.

    image

  2. Additionally, the details being logged are slightly misleading. The following example is formatted for readability, see the full fragment below.

    Service acme-connect-api differs: {"op":"remove","path":"/spec/internalTrafficPolicy"}
    Current Service acme-connect-api path /spec/internalTrafficPolicy has value "Cluster"
    Desired Service acme-connect-api path /spec/internalTrafficPolicy has value 
    

    The last message reads as “the desired value is something non-printable (e.g. an empty string)”, which in turn leads to the understanding that the operator will attempt to apply a patch like {"op":"replace","path":"/spec/internalTrafficPolicy","value":""} which might be the root cause of the issue but it’s not according to the first line.

How To Reproduce

I didn’t manage to reproduce the issue in minikube (Kubernetes 1.23.3). When I deployed Strimzi as described the in [documentation](https://strimzi.io/quickstarts/) and created a KafkaConnect resource below, it didn’t lead to Strimzi creating an -api service. In the environment where the issue is reproducible, we use the Topic and Cluster operators.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  annotations:
    strimzi.io/use-connector-resources: "true"
  labels:
    app: kc
  name: kc
spec:
  bootstrapServers: my-cluster-kafka-bootstrap.kafka.svc:9093
  config:
    config.storage.replication.factor: 1
    offset.storage.replication.factor: 1
    status.storage.replication.factor: 1
  image: strimzi/kafka:0.30.0-kafka-3.2.0
  replicas: 1
  template:
    deployment:
      deploymentStrategy: Recreate
  tls:
    trustedCertificates: []
  version: 3.2.0

I can try to reproduce the issue in isolation or provide more details upon request.

Expected behavior

Strimzi ignores the default value and does not attempt to patch the service.

Environment (please complete the following information):

  • Strimzi version: 0.30.0
  • Installation method: YAML files
  • Kubernetes cluster: Kubernetes 1.23
  • Infrastructure: AWS EC2, kOps

YAML files and logs

2022-10-05 20:56:07 DEBUG AbstractResourceOperator:115 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Service acme/acme-connect-api already exists, patching it
2022-10-05 20:56:07 DEBUG ResourceDiff:33 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/metadata/managedFields"}
2022-10-05 20:56:07 DEBUG ResourceDiff:33 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/clusterIP"}
2022-10-05 20:56:07 DEBUG ResourceDiff:33 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/clusterIPs"}
2022-10-05 20:56:07 DEBUG ResourceDiff:38 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Service acme-connect-api differs: {"op":"remove","path":"/spec/internalTrafficPolicy"}
2022-10-05 20:56:07 DEBUG ResourceDiff:39 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Current Service acme-connect-api path /spec/internalTrafficPolicy has value "Cluster"
2022-10-05 20:56:07 DEBUG ResourceDiff:40 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Desired Service acme-connect-api path /spec/internalTrafficPolicy has value 
2022-10-05 20:56:07 DEBUG AbstractResourceOperator:249 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Caught exception while patching Service acme-connect-api in namespace acme
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://100.64.0.1/api/v1/namespaces/acme/services/acme-connect-api. Message: the server rejected our request due to an error in our request. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=the server rejected our request due to an error in our request, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handlePatch(OperationSupport.java:411) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handlePatch(OperationSupport.java:372) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handlePatch(BaseOperation.java:654) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$patch$3(HasMetadataOperation.java:243) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:248) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:258) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:43) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.fabric8.kubernetes.client.dsl.Patchable.patch(Patchable.java:35) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
	at io.strimzi.operator.common.operator.resource.AbstractResourceOperator.internalPatch(AbstractResourceOperator.java:245) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
	at io.strimzi.operator.common.operator.resource.AbstractResourceOperator.internalPatch(AbstractResourceOperator.java:239) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
	at io.strimzi.operator.common.operator.resource.ServiceOperator.internalPatch(ServiceOperator.java:95) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
	at io.strimzi.operator.common.operator.resource.ServiceOperator.internalPatch(ServiceOperator.java:29) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
	at io.strimzi.operator.common.operator.resource.AbstractResourceOperator.lambda$reconcile$0(AbstractResourceOperator.java:116) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159) ~[io.vertx.vertx-core-4.2.4.jar:4.2.4]
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100) ~[io.vertx.vertx-core-4.2.4.jar:4.2.4]
	at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157) ~[io.vertx.vertx-core-4.2.4.jar:4.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.77.Final.jar:4.1.77.Final]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]

Additional context Applying the same patch to the same resource via kubectl results in a 200 response (not a 422):

kubectl -v6 \
    patch svc acme-connect-api \
    --type=json \
    -p='[{"op":"remove","path":"/spec/internalTrafficPolicy"}]'
...
I1006 22:12:29.502240   73865 round_trippers.go:553] PATCH https://api.example.com/api/v1/namespaces/acme/services/acme-connect-api?fieldManager=kubectl-patch 200 OK in 223 milliseconds
service/acme-connect-api patched

But the patch still shouldn’t be attempted because it doesn’t lead to any changes:

kubectl \
    get svc acme-connect-api \
    -o jsonpath="{.spec.internalTrafficPolicy}"

Cluster

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
scholzjcommented, Oct 19, 2022

Thanks for the feedback on this. We are planning to do 0.32 next week -> so you should be able to move to that once it is out.

0reactions
morozovcommented, Oct 19, 2022

@scholzj we got the dev Cluster Operator deployed to the a dev environment and I can confirm that the unexpected reconciliations don’t happen there. Here’s an excerpt from the logs:

2022-10-19 17:53:26 DEBUG AbstractResourceOperator:121 - Reconciliation #584(timer) KafkaConnect(acme/acme): Service acme/acme-connect-api already exists, patching it
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/metadata/creationTimestamp"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/metadata/managedFields"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/metadata/resourceVersion"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/metadata/uid"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/clusterIP"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/clusterIPs"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/internalTrafficPolicy"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/sessionAffinity"}
2022-10-19 17:53:26 DEBUG ResourceDiff:31 - Reconciliation #584(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/status"}
2022-10-19 17:53:26 DEBUG AbstractResourceOperator:255 - Reconciliation #584(timer) KafkaConnect(acme/acme): Service acme-connect-api in namespace acme did not changed and doesn't need patching

Unfortunately, the deployment took quite a while, and I lost the logs of the previous operator version in the same environment to compare them side-by-side. I cannot quickly roll back to the previous version either because it requires deploying the old cluster-level resources, and this process is insufficiently automated on our end.

To the best of my knowledge, we did have the “Failure executing: PATCH” errors there unrelated to /spec/internalTrafficPolicy and I no longer see them.

Would you consider tagging a stable release so that we could roll it out to all environments? In case it doesn’t help and we still see the failures, I’ll do more troubleshooting on our side and file a new issue with more details.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Service Internal Traffic Policy | Kubernetes
The kube-proxy filters the endpoints it routes to based on the spec.internalTrafficPolicy setting. When it's set to Local , only node local ...
Read more >
RHSA-2022:5069 - Security Advisory - Red Hat Customer Portal
prometheus/client_golang: Denial of service using InstrumentHandlerCounter (CVE-2022-21698); golang: crash in a golang.org/x/crypto/ssh server ( ...
Read more >
Kubernetes 1.24 - What's new? - New features and deprecations
Services have a ClusterIP that is virtual and allows to load balance traffic across the different Pods. This ClusterIP can be assigned:.
Read more >
Deploying and Scaling Microservices with Docker and ...
... work with larger version differences (but they will probably fail randomly, ... will *reconcile* the current state with the spec <br/>(technically, ...
Read more >
Kafka broker pods are not coming up and unable to create ...
spec : loadBalancerIP: xxxxxxx. kafka: version: 2.5.0. replicas: 3 ... Reconciliation #11(watch) Kafka(kafka-op-test/my-cluster): Failed to reconcile.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found