Periodic failure to reconcile a difference in Service .spec.internalTrafficPolicy
See original GitHub issueDescribe the bug
The issue looks similar to https://github.com/strimzi/strimzi-kafka-operator/issues/2691.
-
As of Kubernetes 1.23, service resources have
.spec.internalTrafficPolicy
set toCluster
by default (reference). When reconciling KafkaConnect resources, Strimizi continuously attempts to unset the default value and fails. -
Additionally, the details being logged are slightly misleading. The following example is formatted for readability, see the full fragment below.
Service acme-connect-api differs: {"op":"remove","path":"/spec/internalTrafficPolicy"} Current Service acme-connect-api path /spec/internalTrafficPolicy has value "Cluster" Desired Service acme-connect-api path /spec/internalTrafficPolicy has value
The last message reads as “the desired value is something non-printable (e.g. an empty string)”, which in turn leads to the understanding that the operator will attempt to apply a patch like
{"op":"replace","path":"/spec/internalTrafficPolicy","value":""}
which might be the root cause of the issue but it’s not according to the first line.
How To Reproduce
I didn’t manage to reproduce the issue in minikube (Kubernetes 1.23.3). When I deployed Strimzi as described the in [documentation](https://strimzi.io/quickstarts/) and created a KafkaConnect
resource below, it didn’t lead to Strimzi creating an -api
service. In the environment where the issue is reproducible, we use the Topic and Cluster operators.
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
annotations:
strimzi.io/use-connector-resources: "true"
labels:
app: kc
name: kc
spec:
bootstrapServers: my-cluster-kafka-bootstrap.kafka.svc:9093
config:
config.storage.replication.factor: 1
offset.storage.replication.factor: 1
status.storage.replication.factor: 1
image: strimzi/kafka:0.30.0-kafka-3.2.0
replicas: 1
template:
deployment:
deploymentStrategy: Recreate
tls:
trustedCertificates: []
version: 3.2.0
I can try to reproduce the issue in isolation or provide more details upon request.
Expected behavior
Strimzi ignores the default value and does not attempt to patch the service.
Environment (please complete the following information):
- Strimzi version: 0.30.0
- Installation method: YAML files
- Kubernetes cluster: Kubernetes 1.23
- Infrastructure: AWS EC2, kOps
YAML files and logs
2022-10-05 20:56:07 DEBUG AbstractResourceOperator:115 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Service acme/acme-connect-api already exists, patching it
2022-10-05 20:56:07 DEBUG ResourceDiff:33 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/metadata/managedFields"}
2022-10-05 20:56:07 DEBUG ResourceDiff:33 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/clusterIP"}
2022-10-05 20:56:07 DEBUG ResourceDiff:33 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Ignoring Service acme-connect-api diff {"op":"remove","path":"/spec/clusterIPs"}
2022-10-05 20:56:07 DEBUG ResourceDiff:38 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Service acme-connect-api differs: {"op":"remove","path":"/spec/internalTrafficPolicy"}
2022-10-05 20:56:07 DEBUG ResourceDiff:39 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Current Service acme-connect-api path /spec/internalTrafficPolicy has value "Cluster"
2022-10-05 20:56:07 DEBUG ResourceDiff:40 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Desired Service acme-connect-api path /spec/internalTrafficPolicy has value
2022-10-05 20:56:07 DEBUG AbstractResourceOperator:249 - Reconciliation #12897(timer) KafkaConnect(acme/acme): Caught exception while patching Service acme-connect-api in namespace acme
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://100.64.0.1/api/v1/namespaces/acme/services/acme-connect-api. Message: the server rejected our request due to an error in our request. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[], group=null, kind=null, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=the server rejected our request due to an error in our request, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handlePatch(OperationSupport.java:411) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handlePatch(OperationSupport.java:372) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.handlePatch(BaseOperation.java:654) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$patch$3(HasMetadataOperation.java:243) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:248) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:258) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.patch(HasMetadataOperation.java:43) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.fabric8.kubernetes.client.dsl.Patchable.patch(Patchable.java:35) ~[io.fabric8.kubernetes-client-5.12.2.jar:?]
at io.strimzi.operator.common.operator.resource.AbstractResourceOperator.internalPatch(AbstractResourceOperator.java:245) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
at io.strimzi.operator.common.operator.resource.AbstractResourceOperator.internalPatch(AbstractResourceOperator.java:239) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
at io.strimzi.operator.common.operator.resource.ServiceOperator.internalPatch(ServiceOperator.java:95) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
at io.strimzi.operator.common.operator.resource.ServiceOperator.internalPatch(ServiceOperator.java:29) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
at io.strimzi.operator.common.operator.resource.AbstractResourceOperator.lambda$reconcile$0(AbstractResourceOperator.java:116) ~[io.strimzi.operator-common-0.30.0.jar:0.30.0]
at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159) ~[io.vertx.vertx-core-4.2.4.jar:4.2.4]
at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100) ~[io.vertx.vertx-core-4.2.4.jar:4.2.4]
at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157) ~[io.vertx.vertx-core-4.2.4.jar:4.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.77.Final.jar:4.1.77.Final]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Additional context
Applying the same patch to the same resource via kubectl
results in a 200 response (not a 422):
kubectl -v6 \
patch svc acme-connect-api \
--type=json \
-p='[{"op":"remove","path":"/spec/internalTrafficPolicy"}]'
...
I1006 22:12:29.502240 73865 round_trippers.go:553] PATCH https://api.example.com/api/v1/namespaces/acme/services/acme-connect-api?fieldManager=kubectl-patch 200 OK in 223 milliseconds
service/acme-connect-api patched
But the patch still shouldn’t be attempted because it doesn’t lead to any changes:
kubectl \
get svc acme-connect-api \
-o jsonpath="{.spec.internalTrafficPolicy}"
Cluster
Issue Analytics
- State:
- Created a year ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
Thanks for the feedback on this. We are planning to do 0.32 next week -> so you should be able to move to that once it is out.
@scholzj we got the dev Cluster Operator deployed to the a dev environment and I can confirm that the unexpected reconciliations don’t happen there. Here’s an excerpt from the logs:
Unfortunately, the deployment took quite a while, and I lost the logs of the previous operator version in the same environment to compare them side-by-side. I cannot quickly roll back to the previous version either because it requires deploying the old cluster-level resources, and this process is insufficiently automated on our end.
To the best of my knowledge, we did have the “Failure executing: PATCH” errors there unrelated to
/spec/internalTrafficPolicy
and I no longer see them.Would you consider tagging a stable release so that we could roll it out to all environments? In case it doesn’t help and we still see the failures, I’ll do more troubleshooting on our side and file a new issue with more details.