question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

CruiseControl CPU Utilization Computation

See original GitHub issue

Describe the bug When trying to do a rebalance with Cruise Control, we always get an OptimizationFailureException saying the CpuCapacityGoal cannot be satisfied and we should add more brokers. The Utilization Computation seems off. The CruiseControl REST API Call to /kafkacruisecontrol/load shows that it assumes our brokers have 1 core, which may be the cause of the malcomputation.

In CruiseControl you can set a num.cores for broker capacity (https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_4/config/capacityCores.json), but in Strimzi, this is not possible (https://strimzi.io/docs/operators/latest/using.html#type-CruiseControlSpec-reference)

To Reproduce Steps to reproduce the behavior:

  1. Add cruiseControl to Kafka CR cruiseControl: {}
  2. Create KafkaRebalance with spec: {}
  3. Wait to see Status: Error for request: cluster-main-cruise-control.kafka-devl.svc:9090/kafkacruisecontrol/rebalance?json=true&dryrun=true&verbose=true&skip_hard_goal_check=false. Server returned: Error processing POST request '/rebalance' due to: 'com.linkedin.kafka.cruisecontrol.exception.OptimizationFailureException: [CpuCapacityGoal] Insufficient capacity for cpu (Utilization 858.79, Allowed Capacity 600.00, Threshold: 1.00). Add at least 3 brokers with the same cpu capacity (100.00) as broker-0. Add at least 3 brokers with the same cpu capacity (100.00) as broker-0.'.

Expected behavior Expect to see a “ProposalReady” Status in the KafkaRebalance CR, something like: Status: Conditions: Last Transition Time: 2020-05-19T13:50:12.533Z Status: ProposalReady Type: State Observed Generation: 1 Optimization Result: Data To Move MB: 0 Excluded Brokers For Leadership: Excluded Brokers For Replica Move: Excluded Topics: Intra Broker Data To Move MB: 0 Monitored Partitions Percentage: 100 Num Intra Broker Replica Movements: 0 Num Leader Movements: 0 Num Replica Movements: 26 On Demand Balancedness Score After: 81.8666802863978 On Demand Balancedness Score Before: 78.01176356230222 Recent Windows: 1 Session Id: 05539377-ca7b-45ef-b359-e13564f1458c

Environment (please complete the following information):

  • Strimzi version: AMQ Streams 1.8 (Strimzi 0.24)
  • Installation method: YAML Files
  • Kubernetes cluster: Openshift 4.8.13 (Kubernetes v1.21.1
  • Infrastructure: Openshift on AWS EC2

YAML files and logs

Kafka CR (attached in the zip file)

KafkaRebalance CR (attached in the zip file)

REST Call to /kafkacruisecontrol/load `curl https://cluster-main-cruise-control-kafka-devl.apps.ocp4-prod1.helvetia.io/kafkacruisecontrol/load

                                                      HOST         BROKER         RACK         DISK_CAP(MB)            DISK(MB)/_(%)_            CORE_NUM         CPU(%)          NW_IN_CAP(KB/s)       LEADER_NW_IN(KB/s)     FOLLOWER_NW_IN(KB/s)         NW_OUT_CAP(KB/s)        NW_OUT(KB/s)       PNW_OUT(KB/s)    LEADERS/REPLICAS

cluster-main-kafka-0.cluster-main-kafka-brokers.kafka-devl.svc, 0,eu-central-1c, 512000.000, 22101.055/04.32, 1, 113.978, 10000.000, 2.142, 0.518, 10000.000, 5.632, 7.537, 725/1809 cluster-main-kafka-1.cluster-main-kafka-brokers.kafka-devl.svc, 1,eu-central-1a, 512000.000, 23908.867/04.67, 1, 198.530, 10000.000, 2.703, 8.209, 10000.000, 6.910, 142.473, 747/1999 cluster-main-kafka-2.cluster-main-kafka-brokers.kafka-devl.svc, 2,eu-central-1b, 512000.000, 23526.426/04.60, 1, 72.533, 10000.000, 6.253, 4.443, 10000.000, 13.036, 23.018, 693/2006 cluster-main-kafka-3.cluster-main-kafka-brokers.kafka-devl.svc, 3,eu-central-1b, 512000.000, 17625.527/03.44, 1, 35.741, 10000.000, 6.081, 1.061, 10000.000, 123.725, 245.161, 736/1879 cluster-main-kafka-4.cluster-main-kafka-brokers.kafka-devl.svc, 4,eu-central-1c, 512000.000, 17014.631/03.32, 1, 272.399, 10000.000, 0.509, 9.539, 10000.000, 238.128, 371.253, 643/1797 cluster-main-kafka-5.cluster-main-kafka-brokers.kafka-devl.svc, 5,eu-central-1a, 512000.000, 17739.572/03.46, 1, 189.939, 10000.000, 0.825, 0.846, 10000.000, 2.193, 5.035, 683/1734`

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:17 (16 by maintainers)

github_iconTop GitHub Comments

2reactions
fvalericommented, Nov 30, 2021

Thanks @kyguy, I’ll continue my analysis as soon as I have some time.

In the meantime, as a workaround, we can exclude CPU goals from the preset hard/default goals like this:

  cruiseControl:
    config:
      hard.goals: >
        com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.MinTopicLeadersPerBrokerGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
      default.goals: >
        com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.MinTopicLeadersPerBrokerGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,
        com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal

This works in my test env with all brokers at around 150% CPU utilization.

1reaction
kyguycommented, Dec 1, 2021

There are two issues going on here:

(1) Strimzi Cruise Control always displays 1 CPU per broker

This is strictly a UI issue (it does not cause the rebalance errors) but still needs to be fixed. I can confirm that Strimzi Cruise Control is configured to only ever display 1 CPU core per broker. This single “virtual” CPU core has the cycles of 0 or more CPU cores (however many cores are configured in spec.kafka.resources.limits.cpu). But it has no effect on the rebalance behavior since the ratio between the utilization and capacity remains the same whether we specify the cores explicitly or not. But I do agree that the display is unintuitve, incorrect, and should be fixed!

When we configure the numCore correctly, this error:

Insufficient capacity for cpu (Utilization 858.79, Allowed Capacity 600.00, Threshold: 1.00).

will change to this error:

Insufficient capacity for cpu (Utilization 1717.58, Allowed Capacity 1200.00, Threshold: 1.00).

This is because the utilization and the capacity values are both multiplied by the numCores. This leads us to the second issue:

(2) The CPU utilization value is incorrect

This is causing the rebalance errors. This was introduced with the openjdk 11.0.13 2021-10-19 LTS package where the behavior of getProcessCpuLoad()[2], the method Cruise Control relies on to measure broker CPU utilization, has changed [1]. This change causes Cruise Control to incorrectly calculate CPU utilization values of a broker that are greater than capacity values of that broker! [3] This is incorrect behavior!

The breaking changes of getProcessCpuLoad() were released in openjdk 11.0.13 2021-10-19 LTS. This package was shipped with AMQ Streams 1.8 but not in any released Strimzi versions <= 0.26. So this CPU utilization problem is isolated to AMQ Streams 1.8 for now. It does not affect any released versions of Strimzi <= 0.26, but we need a patch to avoid this issue on Strimzi 0.27. Luckily, the the fix for this issue simply involves setting METRICS_REPORTER_KUBERNETES_MODE to false here [4]

Anyways, @fvaleri I can confirm both issues for AMQ Streams 1.8 but Strimzi should be free of issue (2) and rebalance without error! Regardless, we will get these issues patched for Strimzi!

[1] https://bugs.openjdk.java.net/browse/JDK-8269851 [2] https://github.com/linkedin/cruise-control/blob/4e5927b48bf2581ab76acbbecbf42b355b871b65/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java#L406 [3] https://github.com/linkedin/cruise-control/blob/4e5927b48bf2581ab76acbbecbf42b355b871b65/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/ContainerMetricUtils.java#L90-L108 [4] https://github.com/strimzi/strimzi-kafka-operator/blob/d440baad0f7e12738af073fb30b30e9a1ae589f2/cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaBrokerConfigurationBuilder.java#L96

Read more comments on GitHub >

github_iconTop Results From Across the Web

kafka-cruise-control/Lobby - Gitter
On Cruise Control, I am seeing CPU Usage at 80%, but using Metricbeats, ... @robotrohit This CPU percentage calculation is for one physical...
Read more >
Open Sourcing Kafka Cruise Control | LinkedIn Engineering
Kafka clusters must be continually balanced with respect to disk, network, and CPU utilization. · When a broker fails, we need to automatically ......
Read more >
Chapter 12. Cruise Control for cluster rebalancing
Cruise Control automates this cluster rebalancing process. It constructs a workload model of resource utilization, based on CPU, disk, and network load.
Read more >
The secret sauce behind LinkedIn's self-managing Kafka ...
Cruise Control Architecture. – Challenges and Solutions ... E.g. Broker CPU utilization is caused by Broker Bytes In Rate, Broker Messages In Rate,...
Read more >
Optimizing Kafka cluster with Cruise Control - IBM Event ...
It constructs a workload model of resource utilization for the cluster based on CPU, disk, and network load, ​and generates optimization proposals (that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found