question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug report: sync group request timeout error will lead to consumer come in an abnormal status

See original GitHub issue

My env got a bug that if sync group request timeout then the consumer will enter an abnormal status, I found the reason is that the var rejoin_needed‘s value of coordinator is False, But in actually the value need be true if we want to rebalance again.

Let’s see some code about this. https://github.com/dpkp/kafka-python/blob/d2f9413b0311e6ec4d782cf9983f61c9f258cc7b/kafka/coordinator/base.py#L588-L604

errback callback function is self._failed_request,

this function: https://github.com/dpkp/kafka-python/blob/d2f9413b0311e6ec4d782cf9983f61c9f258cc7b/kafka/coordinator/base.py#L474-L482

In this function, it just mark coordinator dead,but not change the value of rejoin_needed, so next poll will never enter rejoin status and coordinator’s state is <unjoined>.

So I call request_rejoin fuction in _failed_request, I found this bug be fixed:

    def _failed_request(self, node_id, request, future, error):
        log.error('Error sending %s to node %s [%s]',
                  request.__class__.__name__, node_id, error)

        # If sync group request timeout, we need try to rejoin group
        version = 0 if self.config['api_version'] < (0, 11, 0) else 1
        if isinstance(request, SyncGroupRequest[version]):
            self.request_rejoin()

        # Marking coordinator dead
        # unless the error is caused by internal client pipelining
        if not isinstance(error, (Errors.NodeNotReadyError,
                                  Errors.TooManyInFlightRequests)):
            self.coordinator_dead(error)
        future.failure(error)

And I push the patch to repair the problem on my repertory, https://github.com/licy121/kafka-python/commit/bc9ebcfa3ee48f9476f770d0082c07e69c149d5c help me review,thanks!

@jeffwidman @dpkp @mumrah @tvoinarovskyi

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
dpkpcommented, Jan 13, 2019

I think the underlying cause is that rejoin_needed is currently set after the JoinResponse is received, but it should instead happen in the final success callback:

diff --git a/kafka/coordinator/base.py b/kafka/coordinator/base.py
index 8ce9a24..4177115 100644
--- a/kafka/coordinator/base.py
+++ b/kafka/coordinator/base.py
@@ -334,6 +334,7 @@ class BaseCoordinator(object):
             self.join_future = None
             self.state = MemberState.STABLE
             self.rejoining = False
+            self.rejoin_needed = False
             self._heartbeat_thread.enable()
         self._on_join_complete(self._generation.generation_id,
                                self._generation.member_id,
@@ -497,7 +498,6 @@ class BaseCoordinator(object):
                     self._generation = Generation(response.generation_id,
                                                   response.member_id,
                                                   response.group_protocol)
-                    self.rejoin_needed = False

                 if response.leader_id == response.member_id:
                     log.info("Elected group leader -- performing partition"
0reactions
vimal3271commented, Feb 27, 2019

Waiting for release with fix.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Timeout Error When Using kafka-console-consumer ...
When I bring up kafka-console-producer, the same happens. I am pointing both to the same node which is both a Kafka broker and...
Read more >
Chapter 5, Troubleshooting Ceph OSDs
One of the ceph-osd processes is unavailable due to a possible service failure or problems with communication with other OSDs. As a consequence,...
Read more >
Kafka 3.3 Documentation
When a new event is published to a topic, it is actually appended to one of the topic's partitions. Events with the same...
Read more >
HA trouble-shooting - Fortinet Documentation Library
Ensure that the physical interfaces that FortiWeb monitors to check the status of appliances in the cluster (Port Monitor in HA configuration) are...
Read more >
Known issues - PaperCut
PaperCut Mobility Print queues cloned by Print Deploy may cause macOS users to be ... Print Deploy Client may show "Can't reach this...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found