sdn-controller crash with a high number of services and SCK
See original GitHub issueWhat happened:
We deployed SCK 1.4 into our OpenShift 4.x cluster. On an empty cluster, or one with not many pods running it appears to work fine, but as we start to increase the number of pods and services, SCK starts to affect overall cluster health causing total instability of the cluster. Specifically, it is causing the sdn-controller to crash and recycling almost constantly due to overwhelming the OpenShift API. We even end up having actual worker nodes go down due to this. We were able to directly tie this problem to SCK and it occurred in different versions of 4.4.x and 4.5.x of OpenShift. As soon as we undeployed SCK, we have not had a single hiccup with the cluster and have been entirely stable.
What you expected to happen: The sdn-controller shouldn’t crash and the cluster shouldn’t have any problems running pods.
How to reproduce it (as minimally and precisely as possible): Deploy SCK on it’s own namespace on a cluster that has at least 70 pods/services running
Anything else we need to know?:
Environment:
-
oversion: Client Version: 4.3.1 Server Version: 4.5.3 Kubernetes Version: v1.18.3+3107688
-
OS (e.g:
cat /etc/os-release
): Red Hat Enterprise Linux CoreOS 45.82.202007171855-0 -
Splunk version: 7.0.0
-
Others:
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
@fshadid96 it was a bug in the metadata filter, i know you had a bad first touch, but we have had no issues since patching it, I don’t expect you to have to worry once you get updated.
closing. thank you!