Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Cruise control seems to timeout on a rebalance

See original GitHub issue

When doing a rebalance on my cluster, it starts to rebalance for a few minutes and then hits what seems like a hanging state.

It doesn’t transfer anymore data and kafka is in a state where I can’t move anymore partitions. If I try to remove the rebalance and try again, the cluster will report that kafka is working on a job. I can’t find any evidence of cruise control actually doing anything and it’s stuck in this state after a day.

The total amount of data that it is going to transfer is not that much.

Also, is there any way to show a plan of what cruise control intends to do?

Strimzi -0.19 Amazon EKS 25 Brokers 5 zookeeper Probably 13 servers

Deployed rebalance

apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: kafka-cluster
spec: {}

Rebalance state:

Status:
  Conditions:
    Last Transition Time:  2020-11-10T19:54:24.276387Z
    Status:                True
    Type:                  Rebalancing
  Observed Generation:     1
  Optimization Result:
    Data To Move MB:  273150
    Excluded Brokers For Leadership:
    Excluded Brokers For Replica Move:
    Excluded Topics:
    Intra Broker Data To Move MB:         0
    Monitored Partitions Percentage:      100
    Num Intra Broker Replica Movements:   0
    Num Leader Movements:                 9
    Num Replica Movements:                165
    On Demand Balancedness Score After:   76.09682904474765
    On Demand Balancedness Score Before:  50.89423465367239
    Recent Windows:                       1
  Session Id:                             03559c06-4520-47bd-829b-167a358e6

Cruise control settings:

      STRIMZI_KAFKA_BOOTSTRAP_SERVERS:                  kafka-cluster-kafka-bootstrap:9091
      STRIMZI_KAFKA_GC_LOG_ENABLED:                     false
      MIN_INSYNC_REPLICAS:                              2
      BROKER_DISK_MIB_CAPACITY:                         512000.0
      BROKER_CPU_UTILIZATION_CAPACITY:                  100
      BROKER_INBOUND_NETWORK_KIB_PER_SECOND_CAPACITY:   6103515.625
      BROKER_OUTBOUND_NETWORK_KIB_PER_SECOND_CAPACITY:  6103515.625
      KAFKA_HEAP_OPTS:                                  -Xms128M
      CRUISE_CONTROL_CONFIGURATION:                     num.partition.metrics.windows=1
                                                        completed.user.task.retention.time.ms=86400000
                                                        num.broker.metrics.windows=20
                                                        hard.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal
                                                        broker.metrics.window.ms=300000
                                                        default.goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal
                                                        partition.metrics.window.ms=300000
                                                        goals=com.linkedin.kafka.cruisecontrol.analyzer.goals.RackAwareGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundCapacityGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.ReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PotentialNwOutGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.DiskUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkInboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.NetworkOutboundUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.CpuUsageDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.TopicReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderReplicaDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.LeaderBytesInDistributionGoal,com.linkedin.kafka.cruisecontrol.analyzer.goals.PreferredLeaderElectionGoal

cruise.zip

Issue Analytics

State:
Created 3 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

2reactions

kyguycommented, Nov 12, 2020

A similar issue revolving a stuck rebalance was raised by other Kafka and Cruise Control users [1] [2] for Kafka 2.4+. It was related to using deprecated partition reassignment commands. Cruise Control has since then migrated to use the supported AdminClient API [3] for replica reassignment which should address the rebalancing issue. This patched version of Cruise Control is included in Stimzi v0.20 so I would recommend an upgrade to Strimzi 0.20 to see if this issue persists!

[1] https://issues.apache.org/jira/browse/KAFKA-9478 [2] https://github.com/linkedin/cruise-control/issues/1167 [3] https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment

0reactions

tomncoopercommented, Sep 1, 2021

Let us know how you get on. Strimzi 0.25 uses Cruise Control 2.5.57 which contains some significant performance upgrades.

Top Results From Across the Web

kafka-cruise-control/Lobby - Gitter

Hi all! I have been using cruise control and playing around with it on our test clusters and I am really enjoying it!...

Configuring Strimzi

Optional configuration for Cruise Control, which is used to rebalance the Kafka cluster. ... The ZooKeeper session timeout in seconds. Default 18 ....

How Cruise Control rebalancing works | CDP Private Cloud

Cruise Control fixes the cluster by removing the failed brokers. Goal violations, Optimization is violated. Cruise Control automatically analyzes the workload ...

Troubleshooting your Amazon MSK cluster

If one or more of your consumer groups is stuck in a perpetual rebalancing state, the cause might be Apache Kafka issue KAFKA-9752...

WAN Load Balancing timeout issue - Ubiquiti Community

ubnt@ubnt:~$ show load-balance status Group G interface : eth0 carrier : up ... { name WAN_LOCAL } } speed auto } ethernet eth1...