question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consumer failing to rebalance in high load situation

See original GitHub issue

Description

Hello, I’m facing weird rebalance problems with dotnet consumer. In situation, where consumers are far behind, once I scale them, existing instances are not able to rebalance and crash. When I have 6 partitions and 2 consumers in group, once I start another 2, the new ones starts without problems but the existing ones crash. Since we are having them in k8s, it tries to restart crashed ones, which succeeds, but in turn crashes the previous ones. So we end up in infinite crashing loop rendering whole consumer group useless. But I’ve been able to isolate k8s out of this problem (below).

It acctually happens like 4 in 5 times. So I’m suspecting some kind of edge race-condition?

We have auto-commit off (to ensure at-least-once delivery) and it seems, that there is some problem with timing of commits. From librdkafka logs, It seems to me, that if rebalance starts between message received and it’s commit, even though client is aware of rebalance in progress, it still tries to commit before it receives new generation id from JoinGroup response. But that is just my hunch, maybe this event sequence is OK.

Some facts:

  • SessionTimeout - default 45 sec
  • AutoCommit - disabled, we are commiting manually, once message is processed
  • FetchMaxBytes - default 52MB
  • HeartBeatInterval - default 3 sec
  • SocketTimeout - default 60 sec
  • Message size: 1KB
  • Messages behind: few milions - so I’m sure, I’m getting full FetchMaxBytes every time
  • Message process time: 100ms (without exception)
  • Nuget version: 1.8.2
  • Broker version: Running in k8s via Strimzi Operator: quay.io/strimzi/kafka:0.25.0-kafka-2.7.0
  • 3 Brokers, 3 ZKs

What I’ve tried:

  • Fiddle with MaxMessageBytes, SessionTimeout, HeartBeatInterval - those sometimes changed event sequence or lowers the probability of client fail, but I’ve allways been able to get to that situation after few tries.
  • Downgrade to version 1.7.0
  • We observed this behaviour in more complex app, but I’ve been able to reproduce it with bery basic client (almost example consumer from docs), see below

Just to mention, I’ve started this discussion in Strimzi repo. At first, I’ve suspected our k8s setup to be the problem, but after more findings, it lead me to this dotnet client (https://github.com/strimzi/strimzi-kafka-operator/discussions/6440)

How to reproduce

  1. There has to be huge consumer lag on specified topic for consumer group
  2. Start 2 instances of consumers with code below
  3. Start another 2 instances. The previous instances fail with errors above
using System;
using System.Threading;
using System.Threading.Tasks;
using Confluent.Kafka;

namespace KafkaConsumerRebalancer
{
    class Program
    {
        static void Main(string[] args)
        {
            var consumerConfig = new ConsumerConfig()
            {
                BootstrapServers = "ish-kafka-cluster-kafka-bootstrap.kafka:9092",
                GroupId = "eta-calculator-kafka-consumer-rebalancer",
                EnableAutoCommit = false,
                SaslMechanism = SaslMechanism.ScramSha512,
                SecurityProtocol = SecurityProtocol.SaslSsl,
                SaslUsername = "xxx",
                SaslPassword = "xxx",
                EnableSslCertificateVerification = false,
                Debug = "consumer,cgrp,fetch",
            };
            
            var builder = new ConsumerBuilder<Ignore, string>(consumerConfig);
            builder.SetLogHandler((consumer, message) =>
            {
                Console.WriteLine($"{message.Level} | {message.Message}");
            });
            var consumer = builder.Build();
            
            consumer.Subscribe("ish-event.position-cat062-cartesian-changed.fdc-id.v0");

            while (true)
            {
                var consumeResult = consumer.Consume();

                Console.WriteLine($"Received Message from partition {consumeResult.Partition} with offset {consumeResult.Offset}");
                
                Thread.Sleep(100);
                
                Console.WriteLine($"Commiting message");
                consumer.Commit(consumeResult);
            }
            
            
        }
    }
}

I will include whole log of affected situation below, but there are few items with my commentary in chronological order:

2022-03-03T19:31:50.297347072Z Received Message from partition [0] with offset 2062378 2022-03-03T19:31:50.397553035Z Commiting message

Last message sucessfully processed

2022-03-03T19:31:50.526006147Z Debug | [thrd:main]: GroupCoordinator/0: Heartbeat for group “eta-calculator-kafka-consumer-rebalancer” generation id 102

Current generation ID: 102

2022-03-03T19:31:50.526334038Z Received Message from partition [0] with offset 2062379

New message process started

2022-03-03T19:31:50.528462135Z Debug | [thrd:main]: Group “eta-calculator-kafka-consumer-rebalancer” heartbeat error response in state up (join-state steady, 3 partition(s) assigned): Broker: Group rebalance in progress 2022-03-03T19:31:50.528477043Z Debug | [thrd:main]: Group “eta-calculator-kafka-consumer-rebalancer” is rebalancing (EAGER) in state up (join-state steady) with 3 assigned partition(s): rebalance in progress

Consumer noticed rebalance

2022-03-03T19:31:50.529387042Z Debug | [thrd:main]: All partitions awaiting stop are now stopped: serving assignment

All partitions unregistered

2022-03-03T19:31:50.626455406Z Commiting message

Application is done with message processing and calls for commit

2022-03-03T19:31:50.626537268Z Debug | [thrd:main]: Group “eta-calculator-kafka-consumer-rebalancer” received op OFFSET_COMMIT in state up (join-state wait-join) 2022-03-03T19:31:50.626551785Z Debug | [thrd:main]: GroupCoordinator/0: Committing offsets for 1 partition(s) with generation-id 102 in join-state wait-join: manual

Try to commit even though rebalance is still in progress (I think, that this is the problem)

2022-03-03T19:31:50.799984311Z Debug | [thrd:main]: JoinGroup response: GenerationId 103, Protocol range, LeaderId rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b (me), my MemberId rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b, member metadata count 4: (no error)

New generation ID received

2022-03-03T19:31:51.012160131Z Debug | [thrd:main]: GroupCoordinator/0: OffsetCommit for 1 partition(s) in join-state steady: manual: returned: Broker: Specified group generation id is not valid

Response to commit - failed due to old generation ID

2022-03-03T19:31:51.030272222Z Unhandled exception. Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid

Unhandled exception thrown


The event sequence is sometimes little bit different, but in all failing cases, the commit request is always before geting JoinGroup response.

I’m going to try turning on autocommit, maybe another partition assignment strategies (especially sticky ones), probably some of those will “solve” the problem. But even if so, that would only obscure the problem, since I believe, that there is no reason, why current setup should not work.

Thanks a lot for any ideas, Jakub

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version.
  • Apache Kafka version.
  • Client configuration.
  • Operating system.
  • Provide logs (with “debug” : “…” as necessary in configuration).
  • Provide broker log excerpts.
  • Critical issue.

Whole log of consumer failing:

2022-03-03T19:31:50.297347072Z Received Message from partition [0] with offset 2062378
2022-03-03T19:31:50.297613769Z Debug | [thrd:main]: Assignment dump (started_cnt=3, wait_stop_cnt=0)
2022-03-03T19:31:50.297639066Z Debug | [thrd:main]: List with 3 partition(s):
2022-03-03T19:31:50.297644867Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:50.297649897Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:50.297653734Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2] offset STORED
2022-03-03T19:31:50.297658322Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.297662500Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.297959163Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.297969512Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": assignment operations done in join-state steady (rebalance rejoin=false)
2022-03-03T19:31:50.302533019Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/2: Enqueue 1 message(s) (1185 bytes, 1 ops) on ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] fetch queue (qlen 10032, v4, last_offset 2065588, 0 ctrl msgs, uncompressed)
2022-03-03T19:31:50.302599953Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/2: Fetch topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] at offset 2065589 (v4)
2022-03-03T19:31:50.302609591Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/2: Fetch 1/1/2 toppar(s)
2022-03-03T19:31:50.309679006Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Fetch topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] at offset 2062861 (v4)
2022-03-03T19:31:50.309694495Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Fetch 1/1/2 toppar(s)
2022-03-03T19:31:50.397553035Z Commiting message
2022-03-03T19:31:50.398124818Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" received op OFFSET_COMMIT in state up (join-state steady)
2022-03-03T19:31:50.398150185Z Debug | [thrd:main]: GroupCoordinator/0: Committing offsets for 1 partition(s) with generation-id 102 in join-state steady: manual
2022-03-03T19:31:50.521891497Z Debug | [thrd:main]: GroupCoordinator/0: OffsetCommit for 1 partition(s) in join-state steady: manual: returned: Success
2022-03-03T19:31:50.522043120Z Debug | [thrd:main]: Assignment dump (started_cnt=3, wait_stop_cnt=0)
2022-03-03T19:31:50.522208598Z Debug | [thrd:main]: List with 3 partition(s):
2022-03-03T19:31:50.522281704Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:50.523286728Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:50.523976043Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2] offset STORED
2022-03-03T19:31:50.524102268Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.525164929Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.525282448Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.525729181Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": assignment operations done in join-state steady (rebalance rejoin=false)
2022-03-03T19:31:50.526006147Z Debug | [thrd:main]: GroupCoordinator/0: Heartbeat for group "eta-calculator-kafka-consumer-rebalancer" generation id 102
2022-03-03T19:31:50.526334038Z Received Message from partition [0] with offset 2062379
2022-03-03T19:31:50.528462135Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" heartbeat error response in state up (join-state steady, 3 partition(s) assigned): Broker: Group rebalance in progress
2022-03-03T19:31:50.528477043Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" is rebalancing (EAGER) in state up (join-state steady) with 3 assigned partition(s): rebalance in progress
2022-03-03T19:31:50.528484326Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state steady -> wait-unassign-call (state up)
2022-03-03T19:31:50.528489876Z Debug | [thrd:main]: Clearing current assignment of 3 partition(s)
2022-03-03T19:31:50.528608698Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-unassign-call -> wait-unassign-to-complete (state up)
2022-03-03T19:31:50.528851253Z Debug | [thrd:main]: Assignment dump (started_cnt=3, wait_stop_cnt=0)
2022-03-03T19:31:50.528862624Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.528872693Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.528885527Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.528913178Z Debug | [thrd:main]: List with 3 partition(s):
2022-03-03T19:31:50.528937691Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:50.529009428Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:50.529027271Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2] offset STORED
2022-03-03T19:31:50.529033052Z Debug | [thrd:main]: Removing ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] from assignment (started=true, pending=false, queried=false, stored offset=2062380)
2022-03-03T19:31:50.529136094Z Debug | [thrd:main]: Removing ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] from assignment (started=true, pending=false, queried=false, stored offset=INVALID)
2022-03-03T19:31:50.529148788Z Debug | [thrd:main]: Removing ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2] from assignment (started=true, pending=false, queried=false, stored offset=INVALID)
2022-03-03T19:31:50.529161832Z Debug | [thrd:main]: Served 3 removed partition(s), with 1 offset(s) to commit
2022-03-03T19:31:50.529224869Z Debug | [thrd:main]: Current assignment of 0 partition(s) with 0 pending adds, 0 offset queries, 3 partitions awaiting stop and 0 offset commits in progress
2022-03-03T19:31:50.529234988Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": clearing group assignment
2022-03-03T19:31:50.529248874Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" received op PARTITION_LEAVE in state up (join-state wait-unassign-to-complete) for ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0]
2022-03-03T19:31:50.529324415Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": delete ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0]
2022-03-03T19:31:50.529335105Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" received op PARTITION_LEAVE in state up (join-state wait-unassign-to-complete) for ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1]
2022-03-03T19:31:50.529339894Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": delete ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1]
2022-03-03T19:31:50.529351796Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" received op PARTITION_LEAVE in state up (join-state wait-unassign-to-complete) for ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2]
2022-03-03T19:31:50.529370671Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": delete ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2]
2022-03-03T19:31:50.529387042Z Debug | [thrd:main]: All partitions awaiting stop are now stopped: serving assignment
2022-03-03T19:31:50.529396209Z Debug | [thrd:main]: Assignment dump (started_cnt=0, wait_stop_cnt=0)
2022-03-03T19:31:50.529401238Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.529408331Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.529475256Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.529486667Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.529511313Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": assignment operations done in join-state wait-unassign-to-complete (rebalance rejoin=false)
2022-03-03T19:31:50.529546920Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": unassign done in state up (join-state wait-unassign-to-complete)
2022-03-03T19:31:50.529553442Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": Rejoining group without an assignment: Unassignment done
2022-03-03T19:31:50.529568440Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-unassign-to-complete -> init (state up)
2022-03-03T19:31:50.529591362Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": join with 1 subscribed topic(s)
2022-03-03T19:31:50.529654771Z Debug | [thrd:main]: consumer join: metadata for subscription is up to date (18420ms old)
2022-03-03T19:31:50.529661784Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-0.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/0: Joining group "eta-calculator-kafka-consumer-rebalancer" with 1 subscribed topic(s)
2022-03-03T19:31:50.529666813Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state init -> wait-join (state up)
2022-03-03T19:31:50.626455406Z Commiting message
2022-03-03T19:31:50.626537268Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" received op OFFSET_COMMIT in state up (join-state wait-join)
2022-03-03T19:31:50.626551785Z Debug | [thrd:main]: GroupCoordinator/0: Committing offsets for 1 partition(s) with generation-id 102 in join-state wait-join: manual
2022-03-03T19:31:50.720281826Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-0.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-0.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/0: Topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2] in state stopped at offset 2059686 (1/100000 msgs, 0/65536 kb queued, opv 4) is not fetchable: not in active fetch state
2022-03-03T19:31:50.799984311Z Debug | [thrd:main]: JoinGroup response: GenerationId 103, Protocol range, LeaderId rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b (me), my MemberId rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b, member metadata count 4: (no error)
2022-03-03T19:31:50.800048471Z Debug | [thrd:main]: I am elected leader for group "eta-calculator-kafka-consumer-rebalancer" with 4 member(s)
2022-03-03T19:31:50.800107240Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": resetting group leader info: JoinGroup response clean-up
2022-03-03T19:31:50.800152334Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-join -> wait-metadata (state up)
2022-03-03T19:31:50.801784286Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Group "eta-calculator-kafka-consumer-rebalancer": querying for coordinator: OffsetCommitRequest failed
2022-03-03T19:31:50.801945938Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] in state stopped at offset 2059582 (1/100000 msgs, 0/65536 kb queued, opv 4) is not fetchable: not in active fetch state
2022-03-03T19:31:50.802072714Z Debug | [thrd:main]: GroupCoordinator/0: OffsetCommit for 1 partition(s) in join-state wait-metadata: manual: returned: Local: Operation in progress
2022-03-03T19:31:50.803739278Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" running range assignor for 4 member(s) and 1 eligible subscribed topic(s):
2022-03-03T19:31:50.803756510Z Debug | [thrd:main]:  Member "rdkafka-28996ae2-657a-48be-ac7a-46a1f8fc75fa" with 0 owned partition(s) and 1 subscribed topic(s):
2022-03-03T19:31:50.803762501Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [-1]
2022-03-03T19:31:50.803786175Z Debug | [thrd:main]:  Member "rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b" (me) with 0 owned partition(s) and 1 subscribed topic(s):
2022-03-03T19:31:50.804064654Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [-1]
2022-03-03T19:31:50.804075294Z Debug | [thrd:main]:  Member "rdkafka-e5a0bd5d-7176-48ec-8f41-4c22e0cae95b" with 0 owned partition(s) and 1 subscribed topic(s):
2022-03-03T19:31:50.804079823Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [-1]
2022-03-03T19:31:50.804324629Z Debug | [thrd:main]:  Member "rdkafka-3801b6b5-0b37-4777-b9b1-23a194fd0235" with 0 owned partition(s) and 1 subscribed topic(s):
2022-03-03T19:31:50.804335168Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [-1]
2022-03-03T19:31:50.804341260Z Debug | [thrd:main]: range: Topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 with 6 partition(s) and 4 subscribing member(s)
2022-03-03T19:31:50.804345879Z Debug | [thrd:main]: range: Member "rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b": assigned topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 partitions 0..1
2022-03-03T19:31:50.804354495Z Debug | [thrd:main]: range: Member "rdkafka-28996ae2-657a-48be-ac7a-46a1f8fc75fa": assigned topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 partitions 2..3
2022-03-03T19:31:50.804393317Z Debug | [thrd:main]: range: Member "rdkafka-3801b6b5-0b37-4777-b9b1-23a194fd0235": assigned topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 partitions 4..4
2022-03-03T19:31:50.804399398Z Debug | [thrd:main]: range: Member "rdkafka-e5a0bd5d-7176-48ec-8f41-4c22e0cae95b": assigned topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 partitions 5..5
2022-03-03T19:31:50.804422451Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" range assignment for 4 member(s) finished in 0.836ms:
2022-03-03T19:31:50.804530583Z Debug | [thrd:main]:  Member "rdkafka-28996ae2-657a-48be-ac7a-46a1f8fc75fa" assigned 2 partition(s):
2022-03-03T19:31:50.804540231Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [2]
2022-03-03T19:31:50.804544649Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [3]
2022-03-03T19:31:50.804553546Z Debug | [thrd:main]:  Member "rdkafka-267f4f85-8fd4-44a9-aafe-d256bb55957b" (me) assigned 2 partition(s):
2022-03-03T19:31:50.804562813Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0]
2022-03-03T19:31:50.804649354Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1]
2022-03-03T19:31:50.804657479Z Debug | [thrd:main]:  Member "rdkafka-e5a0bd5d-7176-48ec-8f41-4c22e0cae95b" assigned 1 partition(s):
2022-03-03T19:31:50.805292724Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [5]
2022-03-03T19:31:50.805298595Z Debug | [thrd:main]:  Member "rdkafka-3801b6b5-0b37-4777-b9b1-23a194fd0235" assigned 1 partition(s):
2022-03-03T19:31:50.805310757Z Debug | [thrd:main]:   ish-event.position-cat062-cartesian-changed.fdc-id.v0 [4]
2022-03-03T19:31:50.805314885Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": "range" assignor run for 4 member(s)
2022-03-03T19:31:50.805320605Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-metadata -> wait-sync (state up)
2022-03-03T19:31:50.811124394Z Debug | [thrd:sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-br]: sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/2: Topic ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] in state stopped at offset 2062245 (1/100000 msgs, 0/65536 kb queued, opv 4) is not fetchable: not in active fetch state
2022-03-03T19:31:50.811989546Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Group "eta-calculator-kafka-consumer-rebalancer" coordinator is ish-kafka-cluster-kafka-0.ish-kafka-cluster-kafka-brokers.kafka.svc:9092 id 0
2022-03-03T19:31:50.833308164Z Debug | [thrd:main]: SyncGroup response: Success (77 bytes of MemberState data)
2022-03-03T19:31:50.833325256Z Debug | [thrd:main]: List with 2 partition(s):
2022-03-03T19:31:50.833330927Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset INVALID
2022-03-03T19:31:50.833335645Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset INVALID
2022-03-03T19:31:50.833339663Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-sync -> wait-assign-call (state up)
2022-03-03T19:31:50.833344572Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": new assignment of 2 partition(s) in join-state wait-assign-call
2022-03-03T19:31:50.833348780Z Debug | [thrd:main]: No current assignment to clear
2022-03-03T19:31:50.833379317Z Debug | [thrd:main]: Added 2 partition(s) to assignment which now consists of 2 partition(s) where of 2 are in pending state and 0 are being queried
2022-03-03T19:31:50.833385388Z Debug | [thrd:main]: Resuming fetchers for 2 assigned partition(s): assign called
2022-03-03T19:31:50.833390157Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-assign-call -> steady (state up)
2022-03-03T19:31:50.833395356Z Debug | [thrd:main]: Assignment dump (started_cnt=0, wait_stop_cnt=0)
2022-03-03T19:31:50.833400276Z Debug | [thrd:main]: List with 2 partition(s):
2022-03-03T19:31:50.833412508Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:50.833489723Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:50.833498279Z Debug | [thrd:main]: List with 2 partition(s):
2022-03-03T19:31:50.833503117Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:50.833507396Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:50.833512154Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.833516052Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:50.833520159Z Debug | [thrd:main]: Current assignment of 2 partition(s) with 0 pending adds, 0 offset queries, 0 partitions awaiting stop and 1 offset commits in progress
2022-03-03T19:31:50.833536339Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": setting group assignment to 2 partition(s)
2022-03-03T19:31:50.833540938Z Debug | [thrd:main]: List with 2 partition(s):
2022-03-03T19:31:50.833545196Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:50.833569681Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:50.907369603Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Group "eta-calculator-kafka-consumer-rebalancer": querying for coordinator: OffsetCommitRequest failed
2022-03-03T19:31:50.907849007Z Debug | [thrd:main]: GroupCoordinator/0: OffsetCommit for 1 partition(s) in join-state steady: manual: returned: Local: Operation in progress
2022-03-03T19:31:50.910139537Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-1.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/1: Group "eta-calculator-kafka-consumer-rebalancer" coordinator is ish-kafka-cluster-kafka-0.ish-kafka-cluster-kafka-brokers.kafka.svc:9092 id 0
2022-03-03T19:31:51.011413483Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-2.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/2: Group "eta-calculator-kafka-consumer-rebalancer": querying for coordinator: OffsetCommitRequest failed
2022-03-03T19:31:51.012160131Z Debug | [thrd:main]: GroupCoordinator/0: OffsetCommit for 1 partition(s) in join-state steady: manual: returned: Broker: Specified group generation id is not valid
2022-03-03T19:31:51.012591185Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" initiating rebalance (EAGER) in state up (join-state steady) with 2 assigned partition(s) (lost): OffsetCommit error: Illegal generation
2022-03-03T19:31:51.012987436Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": current assignment of 2 partition(s) lost: OffsetCommit error: Illegal generation: revoking assignment and rejoining
2022-03-03T19:31:51.013003306Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state steady -> wait-unassign-call (state up)
2022-03-03T19:31:51.013009507Z Debug | [thrd:main]: Clearing current assignment of 2 partition(s)
2022-03-03T19:31:51.013152171Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-unassign-call -> wait-unassign-to-complete (state up)
2022-03-03T19:31:51.013590918Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": current assignment no longer considered lost: unassign() called
2022-03-03T19:31:51.013606918Z Debug | [thrd:main]: Assignment dump (started_cnt=0, wait_stop_cnt=0)
2022-03-03T19:31:51.013620043Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.013633237Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.013695513Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.013759282Z Debug | [thrd:main]: List with 2 partition(s):
2022-03-03T19:31:51.013766766Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] offset STORED
2022-03-03T19:31:51.013779580Z Debug | [thrd:main]:  ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] offset STORED
2022-03-03T19:31:51.013937735Z Debug | [thrd:main]: Removing ish-event.position-cat062-cartesian-changed.fdc-id.v0 [0] from assignment (started=false, pending=false, queried=false, stored offset=INVALID)
2022-03-03T19:31:51.013948845Z Debug | [thrd:main]: Removing ish-event.position-cat062-cartesian-changed.fdc-id.v0 [1] from assignment (started=false, pending=false, queried=false, stored offset=INVALID)
2022-03-03T19:31:51.013954296Z Debug | [thrd:main]: Served 2 removed partition(s), with 0 offset(s) to commit
2022-03-03T19:31:51.013959455Z Debug | [thrd:main]: Current assignment of 0 partition(s) with 0 pending adds, 0 offset queries, 0 partitions awaiting stop and 1 offset commits in progress
2022-03-03T19:31:51.013972429Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": clearing group assignment
2022-03-03T19:31:51.014194536Z Debug | [thrd:main]: Assignment dump (started_cnt=0, wait_stop_cnt=0)
2022-03-03T19:31:51.014812558Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.015150438Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.015650109Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.015835255Z Debug | [thrd:main]: List with 0 partition(s):
2022-03-03T19:31:51.015862585Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": assignment operations done in join-state wait-unassign-to-complete (rebalance rejoin=false)
2022-03-03T19:31:51.015869659Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": unassign done in state up (join-state wait-unassign-to-complete)
2022-03-03T19:31:51.017145735Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": Rejoining group without an assignment: Unassignment done
2022-03-03T19:31:51.017161374Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state wait-unassign-to-complete -> init (state up)
2022-03-03T19:31:51.017182664Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer": join with 1 subscribed topic(s)
2022-03-03T19:31:51.017188945Z Debug | [thrd:main]: consumer join: metadata for subscription is up to date (212ms old)
2022-03-03T19:31:51.017194145Z Debug | [thrd:main]: sasl_ssl://ish-kafka-cluster-kafka-0.ish-kafka-cluster-kafka-brokers.kafka.svc:9092/0: Joining group "eta-calculator-kafka-consumer-rebalancer" with 1 subscribed topic(s)
2022-03-03T19:31:51.017199445Z Debug | [thrd:main]: Group "eta-calculator-kafka-consumer-rebalancer" changed join state init -> wait-join (state up)
2022-03-03T19:31:51.030272222Z Unhandled exception. Confluent.Kafka.KafkaException: Broker: Specified group generation id is not valid
2022-03-03T19:31:51.030289424Z    at Confluent.Kafka.Impl.SafeKafkaHandle.Commit(IEnumerable`1 offsets)
2022-03-03T19:31:51.030295616Z    at Confluent.Kafka.Consumer`2.Commit(ConsumeResult`2 result)
2022-03-03T19:31:51.030300996Z    at KafkaConsumerRebalancer.Program.Main(String[] args) in /src/Program.cs:line 51
2022-03-03T19:31:52.0481417Z   Stream closed EOF for ish-components/kafka-consumer-rebalancer-78d55fb657-jnx89 (kafka-consumer-rebalancer)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
mhowlettcommented, Jul 21, 2022

Ok, I’ve looked at this a bit closer now.

The following code:

var consumeResult = consumer.Consume();
Thread.Sleep(100);
consumer.Commit(consumeResult);

can expect to have commits fail, due to the rebalance happening before commit is called (as you are seeing). You can just catch the exception and ignore it or log it - you’ll get at least once semantics, though there will definitely be double processing.

To prevent the double processing, you could add a partitions revoked handler, which will get called as a side effect of the Consume call. Adding this handler will effectively hold up the rebalance until Consume is called in order to execute the handler. You won’t actually need to commit offsets in the revoked handler in your example, since you know you have committed final offsets because you’re committing after every message. Note that even with this change, you will still only have at-least once semantics, since various other failure scenarios could result in double processing. But you will have prevented this common case.

Probably the best way to get at least once processing semantics is to leave auto commit enabled, set AutoOffsetStore to false and use StoreOffset (instead of Commit). This will cause offsets to be committed periodically in the background, and also makes sure that latest offsets that have been stored are committed before partitions are revoked in a rebalance. However you’ll still need to set a partitions revoked handler to hold up the rebalance until the next consume call, if you want to prevent the double processing noted above.

@dancarlstedt - hopefully that clears up your followup question.

0reactions
dancarlstedtcommented, Jul 11, 2022

I’ve ran into a similar issue within a few of our consumers when we’re consuming a large volume of messages from a multi-partition topic. We’re also using k8s to host so the consumer pods get into a nasty crash loop cycle fighting each other which causes slower processing and more crash loops.

To repro this locally I’ve setup a single broker instance with an 8 partition topic and published a few million messages (I turns out a few 100k prob would have been enough). With a single consumer process I’m able to processes the message as expected, but as I start additional processes the rebalance is crashing my original process.

To help diagnose what was occurring I decided to add in some additional partition handlers to log each time a partition was assigned, lost, or revoked. To my surprise this actually fixed my issue.

  • SetPartitionsAssignedHandler
  • SetPartitionsLostHandler
  • SetPartitionsRevokedHandler

My Consumer builder code to add in the additional logging to the partition handlers:

var consumer = new ConsumerBuilder<Ignore, TMessage>(config)
    .SetAvroValueDeserializer(schemaRegistry) 
    .SetErrorHandler((c, error) => { _logger.LogWarning("Consumer error has occured. {@Error}", error); })
    .SetPartitionsAssignedHandler((c, partitions)=> _logger.LogWarning("Partitions were assigned: [{@Partitions}]", string.Join(", ", partitions)))
    .SetPartitionsLostHandler((c, partitions)=> _logger.LogWarning("Partitions were lost: [{@Partitions}]", string.Join(", ", partitions)))
    .SetPartitionsRevokedHandler((c, partitions) => _logger.LogWarning("Partitions were revoked: [{@Partitions}]", string.Join(", ", partitions)))
.Build();

Could it be that the side effect of setting any of these three delegates to a non-null Func<> is enough to work around this issue?

It appears that this guard clause prevents the Librdkafka.conf_set_rebalance_cb from being called which seems to be the fix for this issue. You can actually see here in that guard clause that you don’t need all 3 of those handlers defined you just need one to allow the set_rebalance_cb to be invoked.

I’ve repro’d this same issue with package versions 1.8.2 and 1.9.0. For now I’m planning to deploy with the additional delegates defined so that I can get around this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solving My Weird Kafka Rebalancing Problems & ...
Consumers Fail Because They Take Too Long​​ Especially when there are unexpectedly large records, i.e., almost all records are processed in a ...
Read more >
Kafka Consumer stuck in rebalancing state
We are having a Kafka consumer, which all of a sudden(without any activity) went into a rebalancing state and got stuck. This caused...
Read more >
Kafka Consumer Group Rebalance (1 of 2)
Set the interval too high and it means that when a consumer does fail it takes longer before the broker is aware and...
Read more >
Understanding Kafka's Consumer Group Rebalancing
Kafka Rebalancing Consequences · 1. Consumption fully stops while the consumer group rebalances the partitions. · 2. If the consumer failure is ...
Read more >
High CPU issue during rebalance in Kafka consumer after ...
High CPU issue during rebalance in Kafka consumer after upgrading to 2.5. Status: Assignee: Priority: Resolution: Resolved. Guozhang Wang.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found