Cruise control not auto rebalancing after first rebalance
See original GitHub issueCruise Control version: v0.1.10
I am testing scaling up Kafka clusters running on kubernetes using cruise control. As part of scaling up, I am creating new Kafka broker pods and then relying on cruise-control to trigger a cluster rebalance to assign partitions to the newly added brokers as per this suggestion by @efeg: https://github.com/linkedin/cruise-control/issues/344#issuecomment-426094872.
I also found a similar issue (https://github.com/linkedin/cruise-control/issues/906) to this one where the ReplicaDistributionGoal
was missing in the anomaly.detection.goals
. Based on this, I have added the ReplicaDistributionGoal
to my anomaly.detection.goals
config and I am relying on cluster rebalancing to be triggered because of this goal failing when new brokers are added.
However, I noticed that cruise control only triggered a cluster rebalance and assigned partitions to the newly created brokers ONCE. In order to be sure that auto-rebalancing is working, I scaled down the cluster and then added brokers back to the cluster. But this time cruise control did not trigger a cluster rebalance, partitions were not assigned to the new brokers and the new brokers remained idle until I manually triggered a cluster rebalance.
I am attaching some screenshots from the cc-ui for reference.
Brokers 4 and 5 were the new brokers that were launched. It seems like only newly created topics were assigned to these brokers and the existing data was not rebalanced and existing partitions were not assigned to these brokers.
Cruise Control Monitor State:
Cruise Control Analyzer State: Note that the ReplicaDistributionGoal
seems to be ready to be analyzed.
Anomaly Detector State: Notice that the ReplicaDistributionGoal
violation was detected only ONCE - this was when I added a broker earlier. But when I added more again, this was never detected and self-healing did not get triggered.
Cruise Control Proposals: It seems like cruise control thinks that it has already fixed the ReplicaDistributionGoal
violation? However, as the first screenshot shows, self-healing did not get triggered and the cluster did not auto-rebalance. The cruise control logs also did not show any log line for triggering self-healing.
Configurations
These are the relevant configurations that I have setup cruise control with:
self.healing.enabled=true
default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerDiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.kafkaassigner.KafkaAssignerEvenRackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
anomaly.detection.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuCapacityGoal
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
@efeg that seems to have fixed the problem. Thanks for your help!
@mhaseebmlk To add on https://github.com/linkedin/cruise-control/issues/1270#issuecomment-659609862, note that for REST API requests, you can explicitly set
exclude_recently_removed_brokers=false
to avoid excluding recently removed brokers. The current default value (i.e. value used if missing) for this parameter istrue
(see code).