question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Messages dropping on the floor with transactional producer

See original GitHub issue

Description

If connectivity to the cluster is lost after a call to CommitTransaction, subsequent transactions will “succeed” although there is no connectivity to the broker.

I don’t think this is the intended behaviour, my understand was we could ignore the delivery reports with a transactional producer (see https://github.com/edenhill/librdkafka/blob/5fa114ccab90b0a7640b2621bf3e88314d731b84/examples/transactions.c#L102-L109).

Based on the debug logs (see below), the "No partitions registered: not sending EndTxn" made me think of this change: https://github.com/edenhill/librdkafka/pull/3271 but I haven’t investigated.

Killing connectivity at other points behaves fine (e.g. before committing).

Apologies in advance if there is something I haven’t understood.

How to reproduce

This repros consistently for me (100%)

using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;
using Confluent.Kafka;

namespace TestProducer
{
    public class Producer
    {
        private const string MessageTopic = "important-numbers-5";
        private readonly IEnumerable<int> _numbersToSend = Enumerable.Range(0, Int32.MaxValue);
        private readonly IProducer<byte[], string> _kafkaProducer;

        public Producer()
        {
            var producerConfig = new ProducerConfig
            {
                BootstrapServers = "kafka:9092",
                SecurityProtocol = SecurityProtocol.Plaintext,
                SaslMechanism = SaslMechanism.Plain,
                TransactionalId =  "txd-id-54321",
                EnableIdempotence = true,
                TransactionTimeoutMs = 10 * 1000,
                MaxInFlight = 1,
                LingerMs = 2,
                MessageSendMaxRetries = Int32.MaxValue,
                QueueBufferingMaxKbytes = 100000,
                CompressionType = CompressionType.None,
                Debug = "broker,protocol,msg,eos",
                Acks = Acks.All,
                BrokerAddressFamily = BrokerAddressFamily.V4
            };

            void LogHandler(IProducer<byte[], string> producer, LogMessage logMessage) =>
                Console.WriteLine(logMessage.Message);

            _kafkaProducer = new ProducerBuilder<byte[], string>(producerConfig)
                .SetLogHandler(LogHandler)
                .Build();
        }

        public void Run()
        {
            _kafkaProducer.InitTransactions(TimeSpan.FromSeconds(10));
            _ = Task.Run(ProduceLoop);
        }

        private void ProduceLoop()
        {
            foreach (var msg in _numbersToSend)
            {
                Console.WriteLine($"Going to send: {msg}");
                _kafkaProducer.BeginTransaction();
                _kafkaProducer.Produce(MessageTopic, new Message<byte[], string>
                {
                    Key = null,
                    Value = msg.ToString(),
                });
                _kafkaProducer.CommitTransaction();
                Console.WriteLine("Successfully committed"); // <--- Here
                // 1. Breakpoint on the above line
                // 2. Kill connection to kafka
                // 3. Resume + remove breakpoint
                // 4. Watch subsequent messages being dropped on the floor every transaction.timeout.ms
            }
        }
    }
}

The way I have been simulating connectivity issues is with socat:

---
version: '3.4'

services:
  zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - "2181:2181"

  kafka:
    image: wurstmeister/kafka:2.13-2.7.0
    environment:
      KAFKA_BROKER_ID: "991"
      KAFKA_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      LOG4J_LOGGER_KAFKA: DEBUG
      LOG4J_LOGGER_ORG_APACHE_KAFKA: DEBUG

  socat:
    image: alpine/socat:latest
    ports:
      - "9092:9092"
    command: tcp-listen:9092,fork,reuseaddr tcp:kafka:9092

and then docker-compose -f docker-compose.kafka.yml kill socat (and 127.0.0.1 kafka in my hosts file)

This is the output from kafkacat -f '%o %s', the gap you can see is where kafka was down. image

The librdkafka logs look like this when “successfully” committing:

[thrd:main]: Cluster connection already in progress: refresh unavailable topics
[thrd:main]: Not selecting any broker for cluster connection: still suppressed for 49ms: no cluster connection
[thrd:main]: Cluster connection already in progress: acquire ProducerID
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Connect to ipv4#127.0.0.1:9092 failed: Unknown error (after 2053ms in state CONNECT) (_TRANSPORT): identical to last error: error log suppressed
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Broker changed state CONNECT -> DOWN
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Broker changed state DOWN -> INIT
[thrd:main]: No brokers available for Transactions (2 broker(s) known)
[thrd:main]: Unable to query for transaction coordinator: Coordinator query timer: No brokers available for Transactions (2 broker(s) known)
[thrd:main]: kafka:9092/991: Selected for cluster connection: acquire ProducerID (broker has 5 connection attempt(s))
[thrd:main]: No brokers available for Transactions (2 broker(s) known)
[thrd:main]: Unable to query for transaction coordinator: Coordinator query timer: No brokers available for Transactions (2 broker(s) known)
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Received CONNECT op
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Broker changed state INIT -> TRY_CONNECT
[thrd:kafka:9092/bootstrap]: kafka:9092/991: broker in state TRY_CONNECT connecting
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Broker changed state TRY_CONNECT -> CONNECT
[thrd:kafka:9092/bootstrap]: kafka:9092/991: Connecting to ipv4#127.0.0.1:9092 (plaintext) with socket 2056
[thrd:main]: Cluster connection already in progress: refresh unavailable topics
[thrd:main]: Not selecting any broker for cluster connection: still suppressed for 49ms: no cluster connection
[thrd:main]: Cluster connection already in progress: acquire ProducerID
[thrd:main]: No brokers available for Transactions (2 broker(s) known)
[thrd:main]: Unable to query for transaction coordinator: Coordinator query timer: No brokers available for Transactions (2 broker(s) known)
[thrd:kafka:9092/bootstrap]: kafka:9092/991: important-numbers-5 [0]: timed out 0+1 message(s) (MsgId 312..312): message.timeout.ms exceeded
[thrd:kafka:9092/bootstrap]: Beginning partition drain for PID{Id:1,Epoch:12} reset for 0 partition(s) with in-flight requests: 1 message(s) timed out on important-numbers-5 [0]
[thrd:kafka:9092/bootstrap]: Idempotent producer state change Assigned -> DrainReset
[thrd:kafka:9092/bootstrap]: All partitions drained
[thrd:kafka:9092/bootstrap]: Idempotent producer state change DrainReset -> RequestPID
[thrd:kafka:9092/bootstrap]: Starting PID FSM timer (fire immediately): Drain done
[thrd:main]: Idempotent producer state change RequestPID -> WaitTransport
[thrd:main]: Starting PID FSM timer: No broker available
[thrd:main]: Cluster connection already in progress: acquire ProducerID
[thrd:main]: No brokers available for Transactions (2 broker(s) known)
[thrd:main]: Unable to query for transaction coordinator: Coordinator query timer: No brokers available for Transactions (2 broker(s) known)
[thrd:app]: Transaction commit message flush complete
[thrd:app]: Transactional API called: commit_transaction
[thrd:main]: No partitions registered: not sending EndTxn
[thrd:main]: Transaction state change BeginCommit -> CommittingTransaction
[thrd:main]: Transaction successfully committed
[thrd:main]: Transaction state change CommittingTransaction -> Ready

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version (1.6.2, librdkafka 1.6.1)
  • Apache Kafka version (2.7.0)
  • Client configuration (included in code above)
  • Operating system (Windows)
  • Provide logs (see above).
  • Provide broker log excerpts (I can get these but this happens while the broker connectivity is down)
  • Critical issue (We don’t have it in prod yet)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
edenhillcommented, Mar 19, 2021

Thanks for a great report! Will investigate next week

0reactions
will118commented, Mar 30, 2021

Yeah, some of them are indeed unnecessary - we are using MaxInFlight = 5 in our app.

Thanks for looking into the issue, what you’ve said makes sense based on my limited familiarity.

Looking forward to the next release!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kafka Producer cannot send to a topic with transactional ...
I am using StreamBridge to produce messages. It appears to attempt to create a transactional producer only when I send the first message...
Read more >
Transactional producer sending phantom messages #684
When a consumer reads from the partition, any message that does not have an associated transaction commit marker will not be returned to...
Read more >
KIP-98 - Exactly Once Delivery and Transactional Messaging
Transactions may straddle log segments. Hence when old segments are deleted, we may lose some messages in the first part of a transaction....
Read more >
Transactions in Apache Kafka
We designed transactions in Kafka primarily for applications that ... The producer.send() could result in duplicate writes of message B due ...
Read more >
Transactional Event Queues and Advanced Queuing ...
Message producers and consumers send and receive messages using ... then the existing subscription is dropped and the new subscription is created.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found