CruiseControl CPU Utilization Computation
See original GitHub issueDescribe the bug When trying to do a rebalance with Cruise Control, we always get an OptimizationFailureException saying the CpuCapacityGoal cannot be satisfied and we should add more brokers. The Utilization Computation seems off. The CruiseControl REST API Call to /kafkacruisecontrol/load shows that it assumes our brokers have 1 core, which may be the cause of the malcomputation.
In CruiseControl you can set a num.cores for broker capacity (https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_4/config/capacityCores.json), but in Strimzi, this is not possible (https://strimzi.io/docs/operators/latest/using.html#type-CruiseControlSpec-reference)
To Reproduce Steps to reproduce the behavior:
- Add cruiseControl to Kafka CR cruiseControl: {}
- Create KafkaRebalance with spec: {}
- Wait to see Status:
Error for request: cluster-main-cruise-control.kafka-devl.svc:9090/kafkacruisecontrol/rebalance?json=true&dryrun=true&verbose=true&skip_hard_goal_check=false. Server returned: Error processing POST request '/rebalance' due to: 'com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: [CpuCapacityGoal] Insufficient capacity for cpu (Utilization 858.79, Allowed Capacity 600.00, Threshold: 1.00). Add at least 3 brokers with the same cpu capacity (100.00) as broker-0. Add at least 3 brokers with the same cpu capacity (100.00) as broker-0.'.
Expected behavior
Expect to see a “ProposalReady” Status in the KafkaRebalance CR, something like:
Status: Conditions: Last Transition Time: 2020-05-19T13:50:12.533Z Status: ProposalReady Type: State Observed Generation: 1 Optimization Result: Data To Move MB: 0 Excluded Brokers For Leadership: Excluded Brokers For Replica Move: Excluded Topics: Intra Broker Data To Move MB: 0 Monitored Partitions Percentage: 100 Num Intra Broker Replica Movements: 0 Num Leader Movements: 0 Num Replica Movements: 26 On Demand Balancedness Score After: 81.8666802863978 On Demand Balancedness Score Before: 78.01176356230222 Recent Windows: 1 Session Id: 05539377-ca7b-45ef-b359-e13564f1458c
Environment (please complete the following information):
- Strimzi version: AMQ Streams 1.8 (Strimzi 0.24)
- Installation method: YAML Files
- Kubernetes cluster: Openshift 4.8.13 (Kubernetes v1.21.1
- Infrastructure: Openshift on AWS EC2
YAML files and logs
Kafka CR (attached in the zip file)
KafkaRebalance CR (attached in the zip file)
REST Call to /kafkacruisecontrol/load `curl https://cluster-main-cruise-control-kafka-devl.apps.ocp4-prod1.helvetia.io/kafkacruisecontrol/load
HOST BROKER RACK DISK_CAP(MB) DISK(MB)/_(%)_ CORE_NUM CPU(%) NW_IN_CAP(KB/s) LEADER_NW_IN(KB/s) FOLLOWER_NW_IN(KB/s) NW_OUT_CAP(KB/s) NW_OUT(KB/s) PNW_OUT(KB/s) LEADERS/REPLICAS
cluster-main-kafka-0.cluster-main-kafka-brokers.kafka-devl.svc, 0,eu-central-1c, 512000.000, 22101.055/04.32, 1, 113.978, 10000.000, 2.142, 0.518, 10000.000, 5.632, 7.537, 725/1809 cluster-main-kafka-1.cluster-main-kafka-brokers.kafka-devl.svc, 1,eu-central-1a, 512000.000, 23908.867/04.67, 1, 198.530, 10000.000, 2.703, 8.209, 10000.000, 6.910, 142.473, 747/1999 cluster-main-kafka-2.cluster-main-kafka-brokers.kafka-devl.svc, 2,eu-central-1b, 512000.000, 23526.426/04.60, 1, 72.533, 10000.000, 6.253, 4.443, 10000.000, 13.036, 23.018, 693/2006 cluster-main-kafka-3.cluster-main-kafka-brokers.kafka-devl.svc, 3,eu-central-1b, 512000.000, 17625.527/03.44, 1, 35.741, 10000.000, 6.081, 1.061, 10000.000, 123.725, 245.161, 736/1879 cluster-main-kafka-4.cluster-main-kafka-brokers.kafka-devl.svc, 4,eu-central-1c, 512000.000, 17014.631/03.32, 1, 272.399, 10000.000, 0.509, 9.539, 10000.000, 238.128, 371.253, 643/1797 cluster-main-kafka-5.cluster-main-kafka-brokers.kafka-devl.svc, 5,eu-central-1a, 512000.000, 17739.572/03.46, 1, 189.939, 10000.000, 0.825, 0.846, 10000.000, 2.193, 5.035, 683/1734`
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:17 (16 by maintainers)
Top GitHub Comments
Thanks @kyguy, I’ll continue my analysis as soon as I have some time.
In the meantime, as a workaround, we can exclude CPU goals from the preset hard/default goals like this:
This works in my test env with all brokers at around 150% CPU utilization.
There are two issues going on here:
(1) Strimzi Cruise Control always displays 1 CPU per broker
This is strictly a UI issue (it does not cause the rebalance errors) but still needs to be fixed. I can confirm that Strimzi Cruise Control is configured to only ever display 1 CPU core per broker. This single “virtual” CPU core has the cycles of 0 or more CPU cores (however many cores are configured in
spec.kafka.resources.limits.cpu
). But it has no effect on the rebalance behavior since the ratio between the utilization and capacity remains the same whether we specify the cores explicitly or not. But I do agree that the display is unintuitve, incorrect, and should be fixed!When we configure the numCore correctly, this error:
will change to this error:
This is because the utilization and the capacity values are both multiplied by the numCores. This leads us to the second issue:
(2) The CPU utilization value is incorrect
This is causing the rebalance errors. This was introduced with the
openjdk 11.0.13 2021-10-19 LTS
package where the behavior ofgetProcessCpuLoad()
[2], the method Cruise Control relies on to measure broker CPU utilization, has changed [1]. This change causes Cruise Control to incorrectly calculate CPU utilization values of a broker that are greater than capacity values of that broker! [3] This is incorrect behavior!The breaking changes of
getProcessCpuLoad()
were released inopenjdk 11.0.13 2021-10-19 LTS
. This package was shipped withAMQ Streams 1.8
but not in any released Strimzi versions <=0.26
. So this CPU utilization problem is isolated toAMQ Streams 1.8
for now. It does not affect any released versions ofStrimzi <= 0.26
, but we need a patch to avoid this issue onStrimzi 0.27
. Luckily, the the fix for this issue simply involves settingMETRICS_REPORTER_KUBERNETES_MODE
tofalse
here [4]Anyways, @fvaleri I can confirm both issues for AMQ Streams 1.8 but Strimzi should be free of issue (2) and rebalance without error! Regardless, we will get these issues patched for Strimzi!
[1] https://bugs.openjdk.java.net/browse/JDK-8269851 [2] https://github.com/linkedin/cruise-control/blob/4e5927b48bf2581ab76acbbecbf42b355b871b65/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java#L406 [3] https://github.com/linkedin/cruise-control/blob/4e5927b48bf2581ab76acbbecbf42b355b871b65/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/ContainerMetricUtils.java#L90-L108 [4] https://github.com/strimzi/strimzi-kafka-operator/blob/d440baad0f7e12738af073fb30b30e9a1ae589f2/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaBrokerConfigurationBuilder.java#L96