Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to achieve 'exactly once' semantics by using an SQL database

See original GitHub issue

In the docs under manual commits (https://kafka.js.org/docs/1.12.0/consuming#manual-commits) it says:

Note that you don't have to store consumed offsets in Kafka, but instead store it in a storage mechanism of your own choosing. That's an especially useful approach when the results of consuming a message are written to a datastore that allows atomically writing the consumed offset with it, like for example a SQL database. When possible it can make the consumption fully atomic and give "exactly once" semantics that are stronger than the default "at-least once" semantics you get with Kafka's offset commit functionality.

Could you please point me to a working example of how this would be achieved?

Many thanks

-Paul

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

2reactions

toledompmcommented, Oct 15, 2021

Don’t have a public working example, but something like this should work:

// DB Repository
const repo = {
  save: (_data) => undefined,
  find: (_data) => undefined,
}


// Consumer
const consumer = kafka.consumer({ groupId: 'my-group' })

await consumer.connect()
await consumer.subscribe({ topic: 'topic-A' })

await consumer.run({
  eachMessage: async ({ topic, partition, message }) => {
    // You need to get an identifier for the message
    // It can be published together with the data like a fakeID
    // Or be generated from groupId, topic, partition and offset
    const id = `${topic}-${partition}-${message.offset}`

    // Query the DB to check if the message is already processed
    const isProcessed = await repo.find(id)
    if (isProcessed) {
      return
    }

    // Process the message

    // Save the message identifier in the DB
    await repo.save(id)
  },
})

Gonna try to create an example repo for this, hope it helps

1reaction

Nevoncommented, Feb 8, 2022

Exactly-once is a very misunderstood topic. When Confluent says that Kafka supports exactly-once semantics, they mean that in the sense that the observable outcome of a topic being processed is the same whether a message has been consumed once or several times, where the observable outcome is a different topic. This is strictly within the context of a stream consuming from one topic and producing to another.

It does not mean that the message is only ever seen once. It just means that if someone is consuming the output topic, the result will be the same whether the stream processor processed the input message once or several times. This can be achieved using a transactional producer. https://dzone.com/articles/interpreting-kafkas-exactly-once-semantics

What @toledompm describes is not exactly-once, for several reasons. First, what if the group is rebalancing and the partition is reassigned to another consumer. Now you potentially have two different consumers processing the same message at the same time. If the first consumer hasn’t finished processing yet, isProcessed will be false for both and they will both continue to process the message. Using something like advisory locks could prevent this case by having whoever comes first take the lock, then check if the message has been processed, process the message and finally write to the DB before releasing the lock. However, what if writing to the DB fails? You can’t “unprocess” the message, so it’s gonna be at-least once regardless.

Going back to the idea of having the “observable outcome”-interpretation of exactly once, you can indeed achieve this also with a transactional database as long as you store the offsets together with the “result” of the operation. For example:

const client = await pool.connect()

consumer.run({
  autoCommit: false,
  eachMessage: async ({ message }) => {
    try {
      await client.query('BEGIN')
      await client.query('INSERT INTO messages(offset, message) VALUES($1, $2) ON CONFLICT DO NOTHING', [message.offset, message.value])
      await client.query('COMMIT')
    } catch (e) {
      await client.query('ROLLBACK')
      throw e
    }
  }
})

The docs should probably be amended, since they might give the wrong idea about when this is useful. The Kafka docs describe it as well.

Top Results From Across the Web

How We Use Exactly-Once Semantics with Apache Kafka

With exactly -once semantics, you avoid losing data in transit, but you also avoid receiving the same data multiple times. This avoids problems ......

Using Exactly Once Semantics - SQLstream Documentation

To implement exactly once using a file-based source, we need to decide which method to use to generate and track watermarks: We need...

How to Achieve Exactly-Once Semantics in Spark Streaming

Exactly -once semantics is one of the advanced topics of stream processing. ... In this article, I'll demonstrate how to use Spark Streaming,...

java - How to ensure exactly-once semantics when consuming ...

I thought my current setup ensured exactly-once semantics: If an exception is thrown, the consumer is not comitting and the database transaction ...

Exactly-Once Semantics Are Possible: Here's How Kafka Does It

It offers end-to-end exactly-once guarantees for a stream processing application that extends from the data read from Kafka, any state ...