question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

long processing consumer restart will stall

See original GitHub issue

We have a long processing consumer. Whenever a new consumer tries to join the group while the long processing consumer is processing, the new consumer will stall. If we kill the long processing consumer and restart it again, it will stall both consumers. When we kill the long processing consumer, that consumer tries to issue a leaveGroup command but it will fail seemingly due to the client request timeout. When we try to start the long processing consumer again, it seems to be sending topic metadata to the broker then the subsequent join group request is issued and returning a future but when I check the server log I don’t see the corresponding request in kafka-request.log. When we construct the consumer, we have the following code:

        self.consumer = KafkaConsumer(bootstrap_servers=bootstrap_servers,
                                      value_deserializer=deserializer,
                                      group_id=self.user_defined_sub_name,
                                      heartbeat_interval_ms=10000,
                                      session_timeout_ms=300000,
                                      enable_auto_commit=False)

on the server side, we use 0.10.0.0 with default settings. looks like a RebalanceInProgressError is thrown

2016-08-22 20:39:08,984 - kafka.coordinator - INFO - Discovered coordinator 100 for group v1.user.queue
2016-08-22 20:39:08,984 - kafka.coordinator.consumer - INFO - Revoking previously assigned partitions set() for group v1.user.queue
2016-08-22 20:39:08,990 - kafka.cluster - DEBUG - Updated cluster metadata to ClusterMetadata(brokers: 1, topics: 1, groups: 1)
2016-08-22 20:39:08,990 - kafka.coordinator - INFO - (Re-)joining group v1.user.queue
2016-08-22 20:39:08,990 - kafka.coordinator - DEBUG - Sending JoinGroup (JoinGroupRequest_v0(group='v1.user.queue', session_timeout=300000, member_id='', protocol_type='consumer', group_protocols=[(protocol_name='range', protocol_metadata=b'\x00\x00\x00\x00\x00\x01\x00\x1av1.messagingtest.user_info\x00\x00\x00\x00'), (protocol_name='roundrobin', protocol_metadata=b'\x00\x00\x00\x00\x00\x01\x00\x1av1.messagingtest.user_info\x00\x00\x00\x00')])) to coordinator 100
2016-08-22 20:39:08,991 - kafka.conn - DEBUG - <BrokerConnection host=10.128.64.81/10.128.64.81 port=9092> Request 5: JoinGroupRequest_v0(group='v1.user.queue', session_timeout=300000, member_id='', protocol_type='consumer', group_protocols=[(protocol_name='range', protocol_metadata=b'\x00\x00\x00\x00\x00\x01\x00\x1av1.messagingtest.user_info\x00\x00\x00\x00'), (protocol_name='roundrobin', protocol_metadata=b'\x00\x00\x00\x00\x00\x01\x00\x1av1.messagingtest.user_info\x00\x00\x00\x00')])
2016-08-22 20:43:04,576 - kafka.conn - WARNING - <BrokerConnection host=10.128.64.81/10.128.64.81 port=9092> timed out after 40000 ms. Closing connection.
2016-08-22 20:43:04,576 - kafka.client - WARNING - Node 100 connection failed – refreshing metadata
2016-08-22 20:43:04,576 - kafka.coordinator - ERROR - Error sending JoinGroupRequest_v0 to node 100 [Error 7 RequestTimedOutError: Request timed out after 40000 ms]
2016-08-22 20:43:04,576 - kafka.coordinator - WARNING - Marking the coordinator dead (node 100) for group v1.user.queue: None.
2016-08-22 20:43:04,678 - kafka.coordinator - DEBUG - Sending group coordinator request for group v1.user.queue to broker 100```

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:6
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
ilaifcommented, Sep 21, 2016

@dalejin2014 Did you solve this? I’m getting this too, and it’s a real pain.

0reactions
jeffwidmancommented, Aug 28, 2017

@CesarLanderos are your offsets getting incremented? That would explain the never processing messages again, and would likely not be related to this error.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[#KAFKA-4086] long processing consumer restart will stall
Whenever a new consumer tries to join the group while the long processing consumer is processing, the new consumer will stall. If we...
Read more >
MassTransit batch consumer stalling, not receiving messages
We've got an issue where from time to time, a consumers for a specific message will stall and stop receiving/processing messages.
Read more >
Kafka Consumer not consuming new messages randomly
There is a process running very often which produces new messages in the topic. ... What you're describing sounds like some sort of...
Read more >
How to Prevent Reactive Java Applications from Stalling
This initiative provides a standard for asynchronous stream processing with non-blocking backpressure for the JVM and JavaScript runtimes. It is ...
Read more >
20 best practices for Apache Kafka at scale | New Relic
There are three main reasons for this: First, consumers of the "hot" (higher throughput) partitions will have to process more messages than ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found