question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kafka: Keeps expiring consumers

See original GitHub issue

Bug Report

Current behavior

We have 10 microservices and all interact with each other via kafka. We have noticed it randomly doesnt subscribes to topic, or randomly stops working and it gives kafka error, heartbeat not received while service on its own works fine.

[Nest] 19 - 06/16/2021, 1:09:12 PM [ClientKafka] ERROR [Connection] Response Heartbeat(key: 12, version: 3) {"timestamp":"2021-06-16T13:09:12.779Z","logger":"kafkajs","broker":"kafka-0.kafka-headless.dev.svc.cluster.local:9092","clientId":"reviews-ts-service-client","error":"The group is rebalancing, so a rejoin is needed","correlationId":1241,"size":10} +2857ms
[Nest] 19 - 06/16/2021, 1:09:12 PM [ClientKafka] ERROR [Runner] The group is rebalancing, re-joining {"timestamp":"2021-06-16T13:09:12.779Z","logger":"kafkajs","groupId":"reviews-consumer-ts-customer-client","memberId":"reviews-ts-service-client-453b2860-fdab-4c01-aa98-e015667b8d3b","error":"The group is rebalancing, so a rejoin is needed","retryCount":0,"retryTime":330} +1m
Nest] 21 - 06/16/2021, 6:49:52 PM [ClientKafka] ERROR [Connection] Response Heartbeat(key: 12, version: 3) {"timestamp":"2021-06-16T18:49:52.458Z","logger":"kafkajs","broker":"kafka-0.kafka-headless.dev.svc.cluster.local:9092","clientId":"captain-ps-service-client","error":"The coordinator is not aware of this member","correlationId":54,"size":10} +327904ms
[Nest] 21 - 06/16/2021, 6:49:52 PM [ClientKafka] ERROR [Runner] The coordinator is not aware of this member, re-joining the group {"timestamp":"2021-06-16T18:49:52.460Z","logger":"kafkajs","groupId":"captain-consumer-ps-client","memberId":"captain-ps-service-client-77090749-5dd9-4d17-a12b-aa072579caec","error":"The coordinator is not aware of this member","retryCount":7,"retryTime":30000} +1m

Input Code

import { KafkaOptions, Transport } from "@nestjs/microservices";
import appConfig from "config/appConfig";

export const microServiceConfig: KafkaOptions = {
  transport: Transport.KAFKA,

  options: {
    client: {
      clientId: 'promocode-service',
      brokers: [...`${appConfig().KafkaHost}`.split(",")],
    },
    consumer: {
      groupId: 'promocode-consumer',
      sessionTimeout: 300000,
      retry: { retries: 30 },
    },
    subscribe: {
      fromBeginning: false,
    }
  }
};

Expected behavior

Not clear why kafka keeps timing out randomly if I redeploy all works and then again it stops. Is it wrapper causing issues? These random issues makes me wonder what causes it.

This is running on k8s and this behavior is seen in 1-2 users only, Kafka has enough memory!

All consumers have different group Id and all have high session timeout as well.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
bigtable2006commented, Oct 21, 2021

Hi @jayeshanandani ,

By default, heartBeat is 3 seconds (heartbeatInterval = 3s) and the interval for call heartbeat method will be called every 5 seconds (maxWaitTimeInMs = 5s).

What does it mean? After every 5 seconds, library will call heartbeat method and determine can make a heartbeat request to Kafka Broker or not by the condition:

Call after every "maxWaitTimeInMs"

async heartbeat() { // kafkajs/src/consumer/consumerGroup.js
....
if (memberId && now >= this.lastRequest + heartbeatInterval) {
   // Make a call to Kafka Broker to keep connection.
  await this.coordinator.heartbeat(payload)
  this.lastRequest = Date.now()
  ...
}

For my case, my method is heavy process (process json, parse json and format), it take more than 26s to finish a message. Look like during that time, my service can not send the heartbeat signal to KafkaBroker any my consumer is expired and killed.

HOW TO RESOLVE THIS ISSUE?

    sessionTimeout: 60000,
    heartbeatInterval: 40000,
    maxWaitTimeInMs: 43000,

sessionTimeout : it should be greater than the processing time of method. heartbeatInterval: someone said, it should 2/3 of sessionTimeout maxWaitTimeInMs: it must be **_greater ** with heartbeatInterval

This issue was resolved by above configuration.


Notes: First time, when I config

    sessionTimeout: 60000,
    heartbeatInterval: 40000,
    maxWaitTimeInMs: 30000,

It always show error:

INFO [GroupCoordinator 1]: Preparing to rebalance group local-commission-normalizer-client in state PreparingRebalance with old generation 16 (__consumer_offsets-33) (reason: removing member local-normalizer-client-6f5cecee-d77b-45a6-9f7b-1f0bff49f5ef on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)

=> The heartbeat will be called too early. It will call after 30s but the condition for sending request to KafkaService is 40s, that why the error happen.

1reaction
jayeshanandanicommented, Jun 21, 2021

@kamilmysliwiec do we need more information here? any input will be of a great help

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to manage expiration of Kafka Groups - Stack Overflow
Now I'm wondering, what's the name of the consumer config option which controls the expiration from this error message? stream · apache-kafka ...
Read more >
KIP-211: Revise Expiration Semantics of Consumer Group ...
The expiration timer should start ticking the moment all group members are gone and the group transitions into Empty state.
Read more >
Kafka Consumer Important Settings: Poll & Internal Threads ...
The way consumers maintain membership in a consumer group and ownership of the partitions assigned to them is by sending heartbeats to a...
Read more >
Solved: Timeout Error When Using kafka-console-consumer ...
When I bring up kafka-console-consumer, a few minor log messages come up, and then it sits waiting for messages correctly.
Read more >
Kafka client terminated with OffsetOutOfRangeException
By the time the batch is done processing, some of the Kafka partition offsets have expired. The offsets are calculated for the next...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found