question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consumer stop consuming after Broker transport failure

See original GitHub issue

Hi,

We encounter a problem with consumers that stop providing new messages to the ‘data’ listener. This seemingly happens after a broker becomes temporarily unavailable (broker transport failure), but only rarely. We observed this on several different consumers on different topics with similar configurations, seemingly randomly (most of the times the consumers resume operations after a broken broker connection).

The consumer is still synchronized with its consumer group (which consists of a single consumer for one topic of 5 partitions), the high offsets increase as new message arrive on the partitions, but the consumer lag keeps increasing and messages are seemingly never properly consumed by the consumer.

We observed this sequence of events, where all partitions of a topic stopped consuming:

  • This ‘event.error’ seems to indicate the beginning of the problem: Error: broker transport failure

  • After this, no stats are logged again, although they were being logged every second before that.

  • 10 seconds after the error, the consumer stops fetching every partition of the topic, with these two event logs happening for each partition:

{ severity: 7, fac: 'FETCH' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Topic TOPIC_NAME [3] in state active at offset 39611 (10/10 msgs, 0/40960 kb queued, opv 6) is not fetchable: queued.min.messages exceeded

{ severity: 7, fac: 'FETCHADD' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Removed TOPIC_NAME [3] from fetch list (0 entries, opv 6)

  • This happens at a time when no new messages are available (partitions with infrequent messages that appear at set times in this test environment), and the ‘data’ listener function does not receive any message, so it is not clear to us why the queue would be full.

Probably linked to #182.

Environment Information

  • OS: Debian Stretch
  • Node Version: 8.11.0
  • node-rdkafka version: 2.4.2

Consumer configuration

'api.version.request': true,
 'message.max.bytes': 150 * 1024 * 1024, // 150 MB
 'receive.message.max.bytes': messageMaxBytes * 1.3,
 // Logging
 'log.connection.close': true,
 'statistics.interval.ms': 1000,
 // Consumer-specific rdkafka settings
 'group.id': group_id,
 'auto.commit.interval.ms': 2000,
 'enable.auto.commit': true,
 'enable.auto.offset.store': true,
 'enable.partition.eof': false,
 'fetch.wait.max.ms': 100,
 'fetch.min.bytes': 1,
 'fetch.message.max.bytes': 20 * 1024 * 1024, // 20 MB
 'fetch.error.backoff.ms': 0,
 'heartbeat.interval.ms': 1000,
 'queued.min.messages': 10,
 'queued.max.messages.kbytes': Math.floor(40 * 1024), // 40 MB
 'session.timeout.ms': 7000,

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:24
  • Comments:14

github_iconTop GitHub Comments

9reactions
carlessistarecommented, Apr 11, 2019

@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it’s pretty critical, since the recovery for this problem, in prod environments is not easy.

7reactions
bobzsj87commented, Mar 20, 2019

Same behaviour and same error of “broker transport failure”. Consumer stops and we can see the lag of a topic caused by that. We have to restart the whole thing

Read more comments on GitHub >

github_iconTop Results From Across the Web

edenhill/librdkafka - Gitter
Our process contains 2 consumer instances and a single producer pointing to the same broker. When we analyzed a memory dump we found...
Read more >
What does "Broker transport failure" mean in kafka?
"transport failure" seems mean the consumer is having network issue with the broker, is that right? what should I do when this error...
Read more >
confluent_kafka API — confluent-kafka 1.9.0 documentation
Stops consuming. Commits offsets, unless the consumer property 'enable.auto.commit' is set to False. Leaves the consumer group.
Read more >
Troubleshooting your Amazon MSK cluster
Restart the group coordinator of the stuck consumer group using the RebootBroker API action. Error delivering broker logs to Amazon CloudWatch Logs. When...
Read more >
Configuring Vertica for Apache Kafka Version 0.9 and Earlier
Kafka brokers running version 0.9.0 or earlier cannot respond to the API query. If the consumer does not receive a reply from the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found