question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KafkaConsumer~consume and KafkaConsumerConsumeNum::KafkaConsumerConsumeNum have different interpretations of the timeout/partition end

See original GitHub issue

In our environment we have very few messages per second coming into our topics, and we want to process these as quickly as we can.

We noticed a considerable latency of 1s minimally (never less than that), and up to a few seconds when using node-rdkafka. Testing with standard kafka and rdkafka performance tools didn’t exhibit any issues and showed us that the cluster and the client are capable of easily handling the amount of messages in our desired latency. Further instrumentation of our own producer code also showed that the latencies of producing messages (time between us handing the message off and receiving the delivery report for that message) where acceptable.

We then tested with a combination of the rdkafka_performance tool as producer of messages that would be acceptable to our system, and measured the latency there: We still observed the 1s latency. In one instance I changed the rate of producing messages to 1/s, and actually got to a state where our consumer would not process any message for multiple seconds, and kafka would report the lag on the consumer group to increase continuously.

Enabling suitable debug logs of rdkafka (aka “all” 😄) showed that while rdkafka reports it is fetching the new messages, it never seemed to hand them over to us.

Reading through the code I foundt that the problem is in KafkaConsumerConsumeNum::KafkaConsumerConsumeNum: When it reaches the end of the partition it simply retries rather than aborting, and for each retry it applies the timeout again. In the worst case this leads to the situation that it reads one message, hits the end-of-partition, retries reading with a timeout of 1s, and just before the timeout ends receives another message. As long as the number of messages doesn’t exceed the requested maximum it will then go on, ending in a potentially “infinite” loop if the producer just sends out a slow stream of messages.

In our case we call consume with a size of 1000, and a timeout of 1000, the idea being to get “a lot of messages” when there are plenty, and avoiding busy loops when there are no messages.

KafkaConsumer~consume(sz, cb) states:

This will keep going until it gets ERR__PARTITION_EOF or ERR__TIMED_OUT so the array may not be the same size you ask for. The size is advisory, but we will not exceed it.

This sounded reasonable for me, I assumed that it would get us whatever it has in big batches (sz=1000), but would quickly stop when the partition has reached the end.

I’ve now changed the timeout to 10ms, which gets us nice latencies, but I think something needs to be done in KafkaConsumerConsumeNum:

  1. Fix the handling of the partition end to actually abort (matching the description of the caller), and/or
  2. Treat the timeout as timeout for the operation, not as timeout for a single fetch, and/or
  3. Somehow fix the documentation so it is clear that calling consume(sz) with a timeout of t could take up to t*sz ms to complete. 😃

Environment Information

  • node-rdkafka version: 2.3.4

node-rdkafka Configuration Settings

  • consume timeout is set to 1000 (ms, the default as per https://github.com/Blizzard/node-rdkafka/blob/5e971743047c373aaf046db15528ff6b818eb281/lib/kafka-consumer.js#L123)
  • consumer global options:
    {
      "debug":"broker,cgrp,topic",
      "enable.auto.commit":false,
      "log.connection.close":false,
      "queue.buffering.max.ms":10,
      "queued.max.messages.kbytes":10240,
      "session.timeout.ms":6000,
      "socket.blocking.max.ms":100,
      "socket.keepalive.enable":true,
      "metadata.broker.list":["kafka:9092"],
      "statistics.interval.ms":30000,
      "group.id":"@collaborne/polaris-service/polaris-service-489987766-9zl60/9e772105-d823-40b6-ae90-4ae1c88131a9",
      "offset_commit_cb":true
    }
    
    (auto-commit is disabled as we’re committing manually after each batch, broker points to a DNS name that rounds-robins over a 3-kafka/3-zookeeper cluster)
  • consumer topic options
    {
      "auto.offset.reset":"latest"
    }
    

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
webmakerstevecommented, Aug 9, 2018

Pretty sure it used to stop at EOF and then it was requested that I change it because the behavior was confusing 😛

I would prefer to get the library back to a true 1:1 mapping in the low level API, and that’s what I’m going to do to get around this. Basically this means errors will be transparently surfaced to the user in the low level API, and it is up to you to handle them (through retries, etc.) if necessary.

I think that will solve this problem.

0reactions
stale[bot]commented, Dec 10, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found