Consumer stop consuming after Broker transport failure
See original GitHub issueHi,
We encounter a problem with consumers that stop providing new messages to the ‘data’ listener. This seemingly happens after a broker becomes temporarily unavailable (broker transport failure), but only rarely. We observed this on several different consumers on different topics with similar configurations, seemingly randomly (most of the times the consumers resume operations after a broken broker connection).
The consumer is still synchronized with its consumer group (which consists of a single consumer for one topic of 5 partitions), the high offsets increase as new message arrive on the partitions, but the consumer lag keeps increasing and messages are seemingly never properly consumed by the consumer.
We observed this sequence of events, where all partitions of a topic stopped consuming:
-
This ‘event.error’ seems to indicate the beginning of the problem:
Error: broker transport failure
-
After this, no stats are logged again, although they were being logged every second before that.
-
10 seconds after the error, the consumer stops fetching every partition of the topic, with these two event logs happening for each partition:
{ severity: 7, fac: 'FETCH' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Topic TOPIC_NAME [3] in state active at offset 39611 (10/10 msgs, 0/40960 kb queued, opv 6) is not fetchable: queued.min.messages exceeded
{ severity: 7, fac: 'FETCHADD' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Removed TOPIC_NAME [3] from fetch list (0 entries, opv 6)
- This happens at a time when no new messages are available (partitions with infrequent messages that appear at set times in this test environment), and the ‘data’ listener function does not receive any message, so it is not clear to us why the queue would be full.
Probably linked to #182.
Environment Information
- OS: Debian Stretch
- Node Version: 8.11.0
- node-rdkafka version: 2.4.2
Consumer configuration
'api.version.request': true,
'message.max.bytes': 150 * 1024 * 1024, // 150 MB
'receive.message.max.bytes': messageMaxBytes * 1.3,
// Logging
'log.connection.close': true,
'statistics.interval.ms': 1000,
// Consumer-specific rdkafka settings
'group.id': group_id,
'auto.commit.interval.ms': 2000,
'enable.auto.commit': true,
'enable.auto.offset.store': true,
'enable.partition.eof': false,
'fetch.wait.max.ms': 100,
'fetch.min.bytes': 1,
'fetch.message.max.bytes': 20 * 1024 * 1024, // 20 MB
'fetch.error.backoff.ms': 0,
'heartbeat.interval.ms': 1000,
'queued.min.messages': 10,
'queued.max.messages.kbytes': Math.floor(40 * 1024), // 40 MB
'session.timeout.ms': 7000,
Issue Analytics
- State:
- Created 5 years ago
- Reactions:24
- Comments:14
Top GitHub Comments
@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it’s pretty critical, since the recovery for this problem, in prod environments is not easy.
Same behaviour and same error of “broker transport failure”. Consumer stops and we can see the lag of a topic caused by that. We have to restart the whole thing