question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consumer offset "stuck" on log compacted messages

See original GitHub issue

Hello,

I’m looking for some help with an issue where my consumer is “stuck” at a certain offset. (Or bunch of offsets: one offset for each partition in the topic.)

To break it down:

  • I am trying to consume a log-compacted topic with 12 partitions (hundreds of millions of messages).
  • I deployed multiple instances of our consumer service.
  • The consumers started out great, but after a while, they all came to a halt. The offsets have not changed for a few days. For each partition, they are still lagging many millions of messages behind.
  • I made a local copy of KafkaJS and added some extra debug logs to do some digging. Here’s what I found:
    • For a given partition, the offset was n.
    • The messages at offset n have been log compacted away. The next available message is at, say, n + 13.
    • KafkaJS would fetch the messages at offset n.
    • It would receive a few messages, but all of them have offset < n. For example, it might receive one message at offset n - 1, or two messages at offset n - 2 and n - 1.
    • in batch.js these messages are ignored, so the result is treated as an empty batch. (It only uses messagesWithinOffset.)
    • The “empty” batch is skipped, and KafkaJS repeats the above steps.
    • My eachBatch callback is never triggered.

I am not sure what is going wrong here: do I need to change some configuration? (Currently using defaults across the board.) Has the consumer group entered a bad state, somehow? Or could this be a bug?

Advice would be greatly appreciated. Thanks in advance! 🙂

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:19 (2 by maintainers)

github_iconTop GitHub Comments

4reactions
jlekcommented, Dec 3, 2019

Hi @tulios

Thanks for looking into this. I am indeed running 1.11.0. I tried 1.12.0-beta.0 after reading your suggestion, but alas, it didn’t fix the issue. Although I do think it’s very closely related to our issue!

I will try to break it down as I understand it. Please let me know if I’ve misunderstood something. 🙂

The issue you mentioned (PR #511)

Underlying Cause

KafkaJS could get stuck on an offset marker for a Control Record, because it would ignore the Control Record, handle an empty batch by doing nothing, fetch the same Control Record Batch again, and repeat.

Fix

Fixed an issue by incrementing the offset by 1 when a control batch was fetched.

In the Java Consumer

Useful comment in the Java consumer: https://github.com/apache/kafka/blob/9aa660786e46c1efbf5605a6a69136a1dac6edb9/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L1499-L1505

This issue

Underlying Cause

KafkaJS can get stuck on an offset marker for a compacted Record, because it will fetch an “empty” batch, handle the “empty” batch by doing nothing, and repeat. (The fetch did return some records, but they are all from an earlier offset than requested, so these records are ignored and the batch is treated like an empty batch.)

Hacky Fix

Can be fixed by incrementing the offset by 1 if the all the fetched messages are from before the requested offset. (It might also be worth checking that fetchedOffset is less than highWatermark, i.e. that we haven’t reached the end of the partition yet.) I confirmed this by copying KafkaJS locally and hacking consumerGroup.js and batch.js a bit.

“Proper” Fix

To match the Java Consumer, we could:

In the Java Consumer

Useful comment in the Java consumer: https://github.com/apache/kafka/blob/9aa660786e46c1efbf5605a6a69136a1dac6edb9/clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java#L1456-L1460

0reactions
jlekcommented, Dec 3, 2019

Oops, forgot this new method I added to Batch:

/**
   * With compressed messages, it's possible for the returned messages to have offsets smaller than the starting offset.
   * These messages will be filtered out (i.e. they are not even included in this.unfilteredMessages)
   * If these are the only messages, the batch will appear as an empty batch.
   * 
   * isEmpty() and isEmptyIncludingFiltered() will always return true if the batch is empty,
   * but this method will only return true if the batch is empty due to log compacted messages.
   * 
   * @returns boolean True if the batch is empty, because of log compacted messages in the partition.
   */
  isEmptyDueToLogCompactedMessages() {
    return this.partitionDataMessages.length > 0 && // There was at least one message
           this.isEmptyIncludingFiltered(); // All messages had an offset lower than the requested offset
  }
Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Kafka log compaction, and how does it work?
While doing the log compaction, Kafka identifies the new versions of messages by comparing the message key and the offset. If we send...
Read more >
Kafka consuming offset from compacted topic issue
The problem is when a topic is compacted consumer doesn't start from offset 0. I tried changing consumer_group_ID and resetting offset. apache- ...
Read more >
Kafka Log Cleaner Issues - anishek agarwal - Medium
Kafka Log Cleaner Issues · Issue 1: Message size invalid · Issue 2: Corrupt Message Size · Issue 3: Clean offset is other...
Read more >
Kafka Log Compaction: A Comprehensive Guide - Hevo Data
Within a data partition, all messages are stored in a sorted manner, based on each message's offset. This is how Apache Kafka stores...
Read more >
Skipping offsets warning in Druid and Kafka transactional ...
The 'skipped offset' check in Druid's Kafka indexing consumer is a sanity check meant to make sure we process every message in order....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found