[Question] Cruise control seems to timeout on a rebalance
See original GitHub issueWhen doing a rebalance on my cluster, it starts to rebalance for a few minutes and then hits what seems like a hanging state.
It doesn’t transfer anymore data and kafka is in a state where I can’t move anymore partitions. If I try to remove the rebalance and try again, the cluster will report that kafka is working on a job. I can’t find any evidence of cruise control actually doing anything and it’s stuck in this state after a day.
The total amount of data that it is going to transfer is not that much.
Also, is there any way to show a plan of what cruise control intends to do?
Strimzi -0.19 Amazon EKS 25 Brokers 5 zookeeper Probably 13 servers
Deployed rebalance
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaRebalance
metadata:
name: my-rebalance
labels:
strimzi.io/cluster: kafka-cluster
spec: {}
Rebalance state:
Status:
Conditions:
Last Transition Time: 2020-11-10T19:54:24.276387Z
Status: True
Type: Rebalancing
Observed Generation: 1
Optimization Result:
Data To Move MB: 273150
Excluded Brokers For Leadership:
Excluded Brokers For Replica Move:
Excluded Topics:
Intra Broker Data To Move MB: 0
Monitored Partitions Percentage: 100
Num Intra Broker Replica Movements: 0
Num Leader Movements: 9
Num Replica Movements: 165
On Demand Balancedness Score After: 76.09682904474765
On Demand Balancedness Score Before: 50.89423465367239
Recent Windows: 1
Session Id: 03559c06-4520-47bd-829b-167a358e6
Cruise control settings:
STRIMZI_KAFKA_BOOTSTRAP_SERVERS: kafka-cluster-kafka-bootstrap:9091
STRIMZI_KAFKA_GC_LOG_ENABLED: false
MIN_INSYNC_REPLICAS: 2
BROKER_DISK_MIB_CAPACITY: 512000.0
BROKER_CPU_UTILIZATION_CAPACITY: 100
BROKER_INBOUND_NETWORK_KIB_PER_SECOND_CAPACITY: 6103515.625
BROKER_OUTBOUND_NETWORK_KIB_PER_SECOND_CAPACITY: 6103515.625
KAFKA_HEAP_OPTS: -Xms128M
CRUISE_CONTROL_CONFIGURATION: num.partition.metrics.windows=1
completed.user.task.retention.time.ms=86400000
num.broker.metrics.windows=20
hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
broker.metrics.window.ms=300000
default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal
partition.metrics.window.ms=300000
goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (6 by maintainers)
Top Results From Across the Web
kafka-cruise-control/Lobby - Gitter
Hi all! I have been using cruise control and playing around with it on our test clusters and I am really enjoying it!...
Read more >Configuring Strimzi
Optional configuration for Cruise Control, which is used to rebalance the Kafka cluster. ... The ZooKeeper session timeout in seconds. Default 18 ....
Read more >How Cruise Control rebalancing works | CDP Private Cloud
Cruise Control fixes the cluster by removing the failed brokers. Goal violations, Optimization is violated. Cruise Control automatically analyzes the workload ...
Read more >Troubleshooting your Amazon MSK cluster
If one or more of your consumer groups is stuck in a perpetual rebalancing state, the cause might be Apache Kafka issue KAFKA-9752...
Read more >WAN Load Balancing timeout issue - Ubiquiti Community
ubnt@ubnt:~$ show load-balance status Group G interface : eth0 carrier : up ... { name WAN_LOCAL } } speed auto } ethernet eth1...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
A similar issue revolving a stuck rebalance was raised by other Kafka and Cruise Control users [1] [2] for Kafka 2.4+. It was related to using deprecated partition reassignment commands. Cruise Control has since then migrated to use the supported AdminClient API [3] for replica reassignment which should address the rebalancing issue. This patched version of Cruise Control is included in Stimzi v0.20 so I would recommend an upgrade to Strimzi 0.20 to see if this issue persists!
[1] https://issues.apache.org/jira/browse/KAFKA-9478 [2] https://github.com/linkedin/cruise-control/issues/1167 [3] https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
Let us know how you get on. Strimzi 0.25 uses Cruise Control 2.5.57 which contains some significant performance upgrades.