Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Broker CPU utilization underestimated on Kubernetes

See original GitHub issue

Underlying Problem

The method, getProcessCpuLoad() [1] [2], which the Cruise Control Metric Reporter uses to gather CPU utilization is NOT cgroup aware. This causes the Cruise Control Metric Reporter to underestimate the CPU utilization of the Kafka brokers.

Not matter what Kubernetes resource restrictions are in place, the metric reporter will return:

CPU Utilization = ((allocated container cores) * (container CPU utilization)) / (cores on physical host)

For example, if you set a 2 core limit on a broker pod that is scheduled to a physical node with 8 cores and max out the CPU of the broker, the reported CPU utilization will be:

0.25 = ((2 cores) * (1.0 utilization)) / (8 cores on physical host)

When the CPU utilization should be:

1.0

Rebalance issues tied to CPU resource underestimation

This causes problems when there is:

Kubernetes CPU resource limits

Although the brokers’ CPU resource will still be properly restricted by K8s, the metric reporter will underestimate the utilization of those CPU resources being allocated. This will make brokers appear to have more CPU resources available than they actually have.

+-----------+
|     B0    |         
|  2 cores  |       K8s CPU limit: 1 core
+-----------+              
   Node 0

The metric reporter will show a CPU utilization of 50% for Broker0 (B0) even when Broker0 is really using 100% of it’s K8s allocated CPU core. This could cause the rebalance operation to assign more partitions to a maxed out broker0.

+-----------+                                         +---------+
|     B0    |      move partitions from B1 to B0      |    B1   |
|  2 cores  |   <---------------------------------    |         |
+-----------+                                         +---------+
   Node 0                                                Node 1

B0 is using 100% of the CPU resources allocated to it by k8s and has no CPU capacity left but metric reporter is reporting that B0 is only using 50% of its CPU resources because it thinks that all of the node’s CPU resources are available to B0.

One broker per node

Even if we only put one broker per node, the reported CPU utilization would only be correct if there were no K8s CPU limits and no other applications running on the same node. Even if this were the case, the estimated load of a broker on a node with multiple cores would not be weighted any differently than a broker on a node with one core. So it would be possible to overload a broker when moving partitions from a node with more cores to a node with less cores.

+-----------+                                     +-----------+
|           |  move load from Node 1 to Node 2    |           |
|  4 cores  |  -------------------------------->  |  1 core   |
+-----------+                                     +-----------+
   Node 1                                            Node 2
  CPU 100%                                           CPU 0%


+-----------+                                     +-----------+  
|           |     move 2 cores worth of work      |           |
|  4 cores  |  -------------------------------->  |  1 core   |
+-----------+                                     +-----------+
   Node 1                                            Node 2
 CPU  50%                                           CPU 200%

We could get around this issue by adding specific broker CPU capacity entries to the Cruise Control capacity configuration to account for the weight, but it would require tracking the nodes that brokers get scheduled on, getting the number of CPU cores that are on that node, and updating the specific broker CPU capacity entries.

Multiple brokers per node

Even when a node is using 100% of its CPU resources, if there is more than one broker on that node, the metric reporter for each broker on that node will report a CPU utilization value that is less than 100%. This gives the appearance that these brokers have more CPU resources than they actually have.

+-----------+                                        +------------+                             
|           |    move load from B0 to B1 and B2      |     B1     |
|     B0    |  ---------------------------------->   |     B2     |
+-----------+                                        +------------+
   Node                                                  Node 2
 CPU 100%                                               CPU 100%
 
Broker0(B0): CPU 100%                               Broker1(B1): CPU 50%              
                                                    Broker2(B2): CPU 50%

In its cluster model, Cruise Control tracks and aggregates the broker load on hosts using hostnames [2]. On bare metal, this works fine since the hostname correspond to the underlying node but on k8s the hostname correspond to the name of the broker’s pod so it’s possible that more one pod could be scheduled on the same physical host. One way to solve this issue would be to alter the Cruise Control metric reporter to query the node names of the broker pods from the K8s API and then update the CC cluster model accordingly.

Potential solution

One potential solution to solve the issues above would be to allow the Cruise Control Metric Reporter to be configured to get the CPU utilization of a JVM process with a method that is aware of container boundaries. Right now, the metric reporter uses getProcessCpuLoad [2], which gets the CPU usage of the JVM with respect to the physical node. There have been recent efforts to update these functions to be aware of their operating environment whether it be a physical host or a container but this specific method has not been updated.

The best approach I have found so far is to still use getProcessCpuLoad() and multiply it by the percentage of CPU resources that the container is allowed, for example:

CPU util =  getProcessCpuLoad()  * ((number of physical cores)/(cgroup share quota))

We could then have a Cruise Control Metric reporter configuration option that would allow this function to be used in place of the original when operating in Kubernetes.

[1] https://github.com/linkedin/cruise-control/blob/6448a828f90b0d391dbe3176f8462a0c8dbf2700/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java#L168 [2] https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad--

Issue Analytics

State:
Created 3 years ago
Comments:9 (7 by maintainers)

Top GitHub Comments

1reaction

kyguycommented, Jun 30, 2020

Multiple brokers per node issue seems to be due to a mismatch between the actual and effective host information. CPU is a host-level resource (see Resource#CPU). Hence, if the host information in cluster metadata matches the broker pods from K8s, this issue would be addressed (see the retrieval of host information from Kafka cluster metadata)

Exactly, the problem here is that Kubernetes pods are agnostic of the physical node they reside on. The node.host() method used to build the cluster model [1] returns the hostname of the pod, not the hostname of the physical node that pod is scheduled on. This causes the cluster model to think that the broker pods are on their own physical nodes even though they may not be. So the CPU utilization values reported by the metric reporters are never aggregated by physical host in the cluster model. To address this issue in the cluster model, we would need to make a calls to the Kubernetes API here [1] using the hostname of a broker pod to retrieve the hostname of the physical node which that pod is scheduled on, and use that information to populate the node.host() information of a broker in the cluster model. We could address the issue this way, but I think it would be a little messier than the fix for the metric reporter

Since cgroup share quota concerns the capacity of brokers, I feel it is preferable to have this information provided by BrokerCapacityConfigResolver#capacityForBroker (i.e. in BrokerCapacityInfo) rather than having changes on metrics reporter side. This also provides higher capacity visibility on CC-side.

The problem with this approach is that no matter how we resolve the capacities for the brokers, the metric reporters are always going to report CPU utilization values with respect to the CPU resources available on the physical host which the broker pods are scheduled on. As soon as, another broker pod or any other application pod is scheduled to the the same physical host as the original broker pod, the CPU utilization values will be underestimated and will not be trustworthy. Of course, we could restrict a physical hosts to only allow the hosting of a broker pod but that would remove the resource utilization benefits of running on Kubernetes in the first place!

As stated above, we could fix this in the the cluster model by leverage the Kubernetes API when building the cluster model but I think it would be a lot cleaner to fix in the metrics reporter. It would follow the the Kubernetes model of treating pods as hosts and abstracting the physical hosts from users. Let me know what you think!

[1] https://github.com/linkedin/cruise-control/blob/f522db3f37cff1a98d5fc42ae2e36fbb230a0ad2/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/LoadMonitor.java#L500

1reaction

kyguycommented, Jun 25, 2020

+1 on using getSystemCpuLoad()/getCpuLoad() when running broker in a container - it could be done configurable imo.

Although this change would be cleaner, the getSystemCpuLoad() and getCpuLoad() methods are not patched for openjdk versions < 14, so the solution would not work when running openjdk versions < 14 . I have been playing with a solution for openjdk versions >= 8 that uses the getProcessCpuLoad() method, the host’s core count, and the containers’s cgroup share quota. It’s trickier solution but it is effective for the earlier openjdk versions as well! I’ll put a proof of concept together and ping you for review