question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cruise control doesn't auto rebalance even with appropriate goals

See original GitHub issue

Cruise control version: v2.0.59

I am testing out cruise control and noticed that cruise control doesn’t auto-rebalance when we scale up the brokers.

These are the steps I did:

  1. Have a 3 node Kafka cluster with constant traffic of ~600msgs/second (just 8 topics overall)
  2. Have cruise control running as well with self.healing turned on and anomaly.notifier.class=com.linkedin.kafka.cruisecontrol.detector.notifier.SelfHealingNotifier
  3. Increase the number of brokers to 6

Expectation:

  • Cruise control detects ‘ReplicaDistribution’ goal is violated, proposes replica movement and executes rebalance and things are automatically distributed among all the brokers (including new).

Reality:

  • I do see proposals from cruise control but unless I manually run rebalance, the new brokers don’t get any partitions.
Optimization has 131 inter-broker replica(6689 MB) moves, 0 intra-broker replica(0 MB) moves and 36 leadership moves with a cluster model of 1 recent windows and 100.000% of the partitions covered.

Stats for ReplicaDistributionGoal(FIXED):
AVG:{cpu:       2.246 networkInbound:      25.123 networkOutbound:      25.108 disk:    2325.525 potentialNwOut:      75.306 replicas:94.33333333333333 leaderReplicas:35.0 topicReplicas:9.433333333333334}
MAX:{cpu:       3.100 networkInbound:      29.466 networkOutbound:      62.520 disk:    4651.177 potentialNwOut:      88.265 replicas:103 leaderReplicas:43 topicReplicas:33}
MIN:{cpu:       1.640 networkInbound:      20.783 networkOutbound:       4.437 disk:       0.000 potentialNwOut:      62.349 replicas:85 leaderReplicas:28 topicReplicas:0}
STD:{cpu:       0.486 networkInbound:       4.341 networkOutbound:      20.123 disk:    2325.522 potentialNwOut:      12.957 replicas:8.673074554171793 leaderReplicas:6.4807406984078595 topicReplicas:6.869221668645902

But the executor does nothing:

ExecutorState: {state: NO_TASK_IN_PROGRESS}
AnalyzerState: {isProposalReady: true, readyGoals: [NetworkInboundUsageDistributionGoal, CpuUsageDistributionGoal, PotentialNwOutGoal, NetworkInboundCapacityGoal, LeaderBytesInDistributionGoal, DiskCapacityGoal, ReplicaDistributionGoal, RackAwareGoal, TopicReplicaDistributionGoal, NetworkOutboundCapacityGoal, CpuCapacityGoal, DiskUsageDistributionGoal, NetworkOutboundUsageDistributionGoal, ReplicaCapacityGoal]}
AnomalyDetectorState: {selfHealingEnabled:[BROKER_FAILURE, DISK_FAILURE, GOAL_VIOLATION, METRIC_ANOMALY], selfHealingDisabled:[], selfHealingEnabledRatio:{BROKER_FAILURE=1.0, DISK_FAILURE=1.0, GOAL_VIOLATION=1.0, METRIC_ANOMALY=1.0}, recentGoalViolations:[], recentBrokerFailures:[], recentMetricAnomalies:[], recentDiskFailures:[], metrics:{meanTimeBetweenAnomalies:{GOAL_VIOLATION:0.00 milliseconds, BROKER_FAILURE:0.00 milliseconds, METRIC_ANOMALY:0.00 milliseconds}, meanTimeToStartFix:0.00 milliseconds, numSelfHealingStarted:0, ongoingAnomalyDuration=0.00 milliseconds}, ongoingSelfHealingAnomaly:None}

Questions:

  • Is this expected? Is cruise control expected to auto-rebalance with self-healing turned on?
  • Will self-healing turn on only when the brokers fail?
  • Is there any other config knob that is missing which will help cluster auto-scale?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
efegcommented, Jul 16, 2020
0reactions
mhaseebmlkcommented, Jul 16, 2020

@aravindvs were you able to get cruise control to trigger auto rebalance every time brokers are added while scaling up? I found this issue and also added ReplicaDistributionGoal to self.anomaly.detector.goals and noticed the cruise control triggered a cluster rebalance once. I then scaled down the cluster and scaled it back up again, however, this time, cruise control did not trigger an auto rebalance and the two new brokers did not get any partitions assigned to those.

I am wondering if this is an issue with cruise control or something wrong with my configuration. We would expect the auto rebalancing to get triggered every time the ReplicaDistributionGoal is violated but it looks like cruise control is not considering it being violated in this case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cluster balancing with Cruise Control - Strimzi
Strimzi Cruise Control support​​ As well as rebalancing a whole cluster, Cruise Control also has the ability to do this balancing automatically ......
Read more >
kafka-cruise-control/Lobby - Gitter
A hard goal violation prevents CC from (1) automatically self-healing upon broker failures or detected goal violations and (2) rebalancing the cluster upon ......
Read more >
Chapter 8. Cruise Control for cluster rebalancing
To rebalance a Kafka cluster, Cruise Control uses optimization goals to generate optimization proposals, which you can approve or reject.
Read more >
Optimizing Kafka cluster with Cruise Control - IBM Event ...
Generating optimization proposals from multiple optimization goals. Rebalancing a Kafka cluster based on an optimization proposal. Note: Event Streams does not ...
Read more >
Open Sourcing Kafka Cruise Control | LinkedIn Engineering
While Kafka has proven to be very stable, there are still operational challenges when running Kafka at such a scale. Brokers fail on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found