Broker CPU utilization underestimated on Kubernetes
See original GitHub issueUnderlying Problem
The method, getProcessCpuLoad() [1] [2], which the Cruise Control Metric Reporter uses to gather CPU utilization is NOT cgroup aware. This causes the Cruise Control Metric Reporter to underestimate the CPU utilization of the Kafka brokers.
Not matter what Kubernetes resource restrictions are in place, the metric reporter will return:
CPU Utilization = ((allocated container cores) * (container CPU utilization)) / (cores on physical host)
For example, if you set a 2 core limit on a broker pod that is scheduled to a physical node with 8 cores and max out the CPU of the broker, the reported CPU utilization will be:
0.25 = ((2 cores) * (1.0 utilization)) / (8 cores on physical host)
When the CPU utilization should be:
1.0
Rebalance issues tied to CPU resource underestimation
This causes problems when there is:
Kubernetes CPU resource limits
Although the brokers’ CPU resource will still be properly restricted by K8s, the metric reporter will underestimate the utilization of those CPU resources being allocated. This will make brokers appear to have more CPU resources available than they actually have.
+-----------+
| B0 |
| 2 cores | K8s CPU limit: 1 core
+-----------+
Node 0
The metric reporter will show a CPU utilization of 50% for Broker0 (B0) even when Broker0 is really using 100% of it’s K8s allocated CPU core. This could cause the rebalance operation to assign more partitions to a maxed out broker0.
+-----------+ +---------+
| B0 | move partitions from B1 to B0 | B1 |
| 2 cores | <--------------------------------- | |
+-----------+ +---------+
Node 0 Node 1
B0 is using 100% of the CPU resources allocated to it by k8s and has no CPU capacity left but metric reporter is reporting that B0 is only using 50% of its CPU resources because it thinks that all of the node’s CPU resources are available to B0.
One broker per node
Even if we only put one broker per node, the reported CPU utilization would only be correct if there were no K8s CPU limits and no other applications running on the same node. Even if this were the case, the estimated load of a broker on a node with multiple cores would not be weighted any differently than a broker on a node with one core. So it would be possible to overload a broker when moving partitions from a node with more cores to a node with less cores.
+-----------+ +-----------+
| | move load from Node 1 to Node 2 | |
| 4 cores | --------------------------------> | 1 core |
+-----------+ +-----------+
Node 1 Node 2
CPU 100% CPU 0%
+-----------+ +-----------+
| | move 2 cores worth of work | |
| 4 cores | --------------------------------> | 1 core |
+-----------+ +-----------+
Node 1 Node 2
CPU 50% CPU 200%
We could get around this issue by adding specific broker CPU capacity entries to the Cruise Control capacity configuration to account for the weight, but it would require tracking the nodes that brokers get scheduled on, getting the number of CPU cores that are on that node, and updating the specific broker CPU capacity entries.
Multiple brokers per node
Even when a node is using 100% of its CPU resources, if there is more than one broker on that node, the metric reporter for each broker on that node will report a CPU utilization value that is less than 100%. This gives the appearance that these brokers have more CPU resources than they actually have.
+-----------+ +------------+
| | move load from B0 to B1 and B2 | B1 |
| B0 | ----------------------------------> | B2 |
+-----------+ +------------+
Node Node 2
CPU 100% CPU 100%
Broker0(B0): CPU 100% Broker1(B1): CPU 50%
Broker2(B2): CPU 50%
In its cluster model, Cruise Control tracks and aggregates the broker load on hosts using hostnames [2]. On bare metal, this works fine since the hostname correspond to the underlying node but on k8s the hostname correspond to the name of the broker’s pod so it’s possible that more one pod could be scheduled on the same physical host. One way to solve this issue would be to alter the Cruise Control metric reporter to query the node names of the broker pods from the K8s API and then update the CC cluster model accordingly.
Potential solution
One potential solution to solve the issues above would be to allow the Cruise Control Metric Reporter to be configured to get the CPU utilization of a JVM process with a method that is aware of container boundaries. Right now, the metric reporter uses getProcessCpuLoad
[2], which gets the CPU usage of the JVM with respect to the physical node. There have been recent efforts to update these functions to be aware of their operating environment whether it be a physical host or a container but this specific method has not been updated.
The best approach I have found so far is to still use getProcessCpuLoad()
and multiply it by the percentage of CPU resources that the container is allowed, for example:
CPU util = getProcessCpuLoad() * ((number of physical cores)/(cgroup share quota))
We could then have a Cruise Control Metric reporter configuration option that would allow this function to be used in place of the original when operating in Kubernetes.
[1] https://github.com/linkedin/cruise-control/blob/6448a828f90b0d391dbe3176f8462a0c8dbf2700/cruise-control-metrics-reporter/src/main/java/com/linkedin/kafka/cruisecontrol/metricsreporter/metric/MetricsUtils.java#L168 [2] https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/OperatingSystemMXBean.html#getProcessCpuLoad--
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (7 by maintainers)
Top GitHub Comments
Exactly, the problem here is that Kubernetes pods are agnostic of the physical node they reside on. The
node.host()
method used to build the cluster model [1] returns the hostname of the pod, not the hostname of the physical node that pod is scheduled on. This causes the cluster model to think that the broker pods are on their own physical nodes even though they may not be. So the CPU utilization values reported by the metric reporters are never aggregated by physical host in the cluster model. To address this issue in the cluster model, we would need to make a calls to the Kubernetes API here [1] using the hostname of a broker pod to retrieve the hostname of the physical node which that pod is scheduled on, and use that information to populate thenode.host()
information of a broker in the cluster model. We could address the issue this way, but I think it would be a little messier than the fix for the metric reporterThe problem with this approach is that no matter how we resolve the capacities for the brokers, the metric reporters are always going to report CPU utilization values with respect to the CPU resources available on the physical host which the broker pods are scheduled on. As soon as, another broker pod or any other application pod is scheduled to the the same physical host as the original broker pod, the CPU utilization values will be underestimated and will not be trustworthy. Of course, we could restrict a physical hosts to only allow the hosting of a broker pod but that would remove the resource utilization benefits of running on Kubernetes in the first place!
As stated above, we could fix this in the the cluster model by leverage the Kubernetes API when building the cluster model but I think it would be a lot cleaner to fix in the metrics reporter. It would follow the the Kubernetes model of treating pods as hosts and abstracting the physical hosts from users. Let me know what you think!
[1] https://github.com/linkedin/cruise-control/blob/f522db3f37cff1a98d5fc42ae2e36fbb230a0ad2/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/monitor/LoadMonitor.java#L500
Although this change would be cleaner, the
getSystemCpuLoad() and getCpuLoad()
methods are not patched for openjdk versions < 14, so the solution would not work when running openjdk versions < 14 . I have been playing with a solution for openjdk versions >= 8 that uses thegetProcessCpuLoad()
method, the host’s core count, and the containers’s cgroup share quota. It’s trickier solution but it is effective for the earlier openjdk versions as well! I’ll put a proof of concept together and ping you for review