Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bug report: sync group request timeout error will lead to consumer come in an abnormal status

See original GitHub issue

My env got a bug that if sync group request timeout then the consumer will enter an abnormal status, I found the reason is that the var rejoin_needed‘s value of coordinator is False, But in actually the value need be true if we want to rebalance again.

Let’s see some code about this. https://github.com/dpkp/kafka-python/blob/d2f9413b0311e6ec4d782cf9983f61c9f258cc7b/kafka/coordinator/base.py#L588-L604

errback callback function is self._failed_request,

this function: https://github.com/dpkp/kafka-python/blob/d2f9413b0311e6ec4d782cf9983f61c9f258cc7b/kafka/coordinator/base.py#L474-L482

In this function, it just mark coordinator dead,but not change the value of rejoin_needed, so next poll will never enter rejoin status and coordinator’s state is <unjoined>.

So I call request_rejoin fuction in _failed_request, I found this bug be fixed:

    def _failed_request(self, node_id, request, future, error):
        log.error('Error sending %s to node %s [%s]',
                  request.__class__.__name__, node_id, error)

        # If sync group request timeout, we need try to rejoin group
        version = 0 if self.config['api_version'] < (0, 11, 0) else 1
        if isinstance(request, SyncGroupRequest[version]):
            self.request_rejoin()

        # Marking coordinator dead
        # unless the error is caused by internal client pipelining
        if not isinstance(error, (Errors.NodeNotReadyError,
                                  Errors.TooManyInFlightRequests)):
            self.coordinator_dead(error)
        future.failure(error)

And I push the patch to repair the problem on my repertory, https://github.com/licy121/kafka-python/commit/bc9ebcfa3ee48f9476f770d0082c07e69c149d5c help me review,thanks!

@jeffwidman @dpkp @mumrah @tvoinarovskyi

Issue Analytics

State:
Created 5 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

dpkpcommented, Jan 13, 2019

I think the underlying cause is that rejoin_needed is currently set after the JoinResponse is received, but it should instead happen in the final success callback:

diff --git a/kafka/coordinator/base.py b/kafka/coordinator/base.py
index 8ce9a24..4177115 100644
--- a/kafka/coordinator/base.py
+++ b/kafka/coordinator/base.py
@@ -334,6 +334,7 @@ class BaseCoordinator(object):
             self.join_future = None
             self.state = MemberState.STABLE
             self.rejoining = False
+            self.rejoin_needed = False
             self._heartbeat_thread.enable()
         self._on_join_complete(self._generation.generation_id,
                                self._generation.member_id,
@@ -497,7 +498,6 @@ class BaseCoordinator(object):
                     self._generation = Generation(response.generation_id,
                                                   response.member_id,
                                                   response.group_protocol)
-                    self.rejoin_needed = False

                 if response.leader_id == response.member_id:
                     log.info("Elected group leader -- performing partition"

0reactions

vimal3271commented, Feb 27, 2019

Waiting for release with fix.