Bug report: sync group request timeout error will lead to consumer come in an abnormal status
See original GitHub issueMy env got a bug that if sync group request timeout then the consumer will enter an abnormal status, I found the reason is that the var rejoin_needed‘s value of coordinator is False, But in actually the value need be true if we want to rebalance again.
Let’s see some code about this. https://github.com/dpkp/kafka-python/blob/d2f9413b0311e6ec4d782cf9983f61c9f258cc7b/kafka/coordinator/base.py#L588-L604
errback callback function is self._failed_request,
this function: https://github.com/dpkp/kafka-python/blob/d2f9413b0311e6ec4d782cf9983f61c9f258cc7b/kafka/coordinator/base.py#L474-L482
In this function, it just mark coordinator dead,but not change the value of rejoin_needed, so next poll will never enter rejoin status and coordinator’s state is <unjoined>.
So I call request_rejoin fuction in _failed_request, I found this bug be fixed:
def _failed_request(self, node_id, request, future, error):
log.error('Error sending %s to node %s [%s]',
request.__class__.__name__, node_id, error)
# If sync group request timeout, we need try to rejoin group
version = 0 if self.config['api_version'] < (0, 11, 0) else 1
if isinstance(request, SyncGroupRequest[version]):
self.request_rejoin()
# Marking coordinator dead
# unless the error is caused by internal client pipelining
if not isinstance(error, (Errors.NodeNotReadyError,
Errors.TooManyInFlightRequests)):
self.coordinator_dead(error)
future.failure(error)
And I push the patch to repair the problem on my repertory, https://github.com/licy121/kafka-python/commit/bc9ebcfa3ee48f9476f770d0082c07e69c149d5c help me review,thanks!
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (2 by maintainers)
I think the underlying cause is that rejoin_needed is currently set after the JoinResponse is received, but it should instead happen in the final success callback:
Waiting for release with fix.