Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consumer stop consuming after Broker transport failure

See original GitHub issue

Hi,

We encounter a problem with consumers that stop providing new messages to the ‘data’ listener. This seemingly happens after a broker becomes temporarily unavailable (broker transport failure), but only rarely. We observed this on several different consumers on different topics with similar configurations, seemingly randomly (most of the times the consumers resume operations after a broken broker connection).

The consumer is still synchronized with its consumer group (which consists of a single consumer for one topic of 5 partitions), the high offsets increase as new message arrive on the partitions, but the consumer lag keeps increasing and messages are seemingly never properly consumed by the consumer.

We observed this sequence of events, where all partitions of a topic stopped consuming:

This ‘event.error’ seems to indicate the beginning of the problem: Error: broker transport failure
After this, no stats are logged again, although they were being logged every second before that.
10 seconds after the error, the consumer stops fetching every partition of the topic, with these two event logs happening for each partition:

{ severity: 7, fac: 'FETCH' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Topic TOPIC_NAME [3] in state active at offset 39611 (10/10 msgs, 0/40960 kb queued, opv 6) is not fetchable: queued.min.messages exceeded

{ severity: 7, fac: 'FETCHADD' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Removed TOPIC_NAME [3] from fetch list (0 entries, opv 6)

This happens at a time when no new messages are available (partitions with infrequent messages that appear at set times in this test environment), and the ‘data’ listener function does not receive any message, so it is not clear to us why the queue would be full.

Probably linked to #182.

Environment Information

OS: Debian Stretch
Node Version: 8.11.0
node-rdkafka version: 2.4.2

Consumer configuration

'api.version.request': true,
 'message.max.bytes': 150 * 1024 * 1024, // 150 MB
 'receive.message.max.bytes': messageMaxBytes * 1.3,
 // Logging
 'log.connection.close': true,
 'statistics.interval.ms': 1000,
 // Consumer-specific rdkafka settings
 'group.id': group_id,
 'auto.commit.interval.ms': 2000,
 'enable.auto.commit': true,
 'enable.auto.offset.store': true,
 'enable.partition.eof': false,
 'fetch.wait.max.ms': 100,
 'fetch.min.bytes': 1,
 'fetch.message.max.bytes': 20 * 1024 * 1024, // 20 MB
 'fetch.error.backoff.ms': 0,
 'heartbeat.interval.ms': 1000,
 'queued.min.messages': 10,
 'queued.max.messages.kbytes': Math.floor(40 * 1024), // 40 MB
 'session.timeout.ms': 7000,

Issue Analytics

State:
Created 5 years ago
Reactions:24
Comments:14

Top GitHub Comments

9reactions

carlessistarecommented, Apr 11, 2019

@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it’s pretty critical, since the recovery for this problem, in prod environments is not easy.

7reactions

bobzsj87commented, Mar 20, 2019

Same behaviour and same error of “broker transport failure”. Consumer stops and we can see the lag of a topic caused by that. We have to restart the whole thing

Top Results From Across the Web

edenhill/librdkafka - Gitter

Our process contains 2 consumer instances and a single producer pointing to the same broker. When we analyzed a memory dump we found...

What does "Broker transport failure" mean in kafka?

"transport failure" seems mean the consumer is having network issue with the broker, is that right? what should I do when this error...

confluent_kafka API — confluent-kafka 1.9.0 documentation

Stops consuming. Commits offsets, unless the consumer property 'enable.auto.commit' is set to False. Leaves the consumer group.

Troubleshooting your Amazon MSK cluster

Restart the group coordinator of the stuck consumer group using the RebootBroker API action. Error delivering broker logs to Amazon CloudWatch Logs. When...

Configuring Vertica for Apache Kafka Version 0.9 and Earlier

Kafka brokers running version 0.9.0 or earlier cannot respond to the API query. If the consumer does not receive a reply from the...