question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timeout Issues with KafkaEO & Transactions

See original GitHub issue

Description

Since introducing Kafka Exactly Once semantics and Transactions into our platform we have recently seen a sharp rise in errors when trying to use the Transactional API. The most common errors we see are:

Failed to initialise Producer ID: Local: Timed out' with Exception
Operation not valid in state WaitPID, at Confluent.Kafka.Impl.SafeKafkaHandle.BeginTransaction() at Confluent.Kafka.Producer`2.BeginTransaction()
Operation not valid in state AbortableError, at Confluent.Kafka.Impl.SafeKafkaHandle.CommitTransaction(Int32 millisecondsTimeout) at Confluent.Kafka.Producer`2.CommitTransaction(TimeSpan timeout)

All of which seem to indicate a long delay in the Broker/Transaction Coordinator responding back to the Producer’s request to Begin a Transaction.

We had several weeks during testing with little to no occurrences of the above but since going Live (just over 2 weeks ago) we have noticed the numbers starting to rise, with them being present in almost all of our Microservices all of the time now. As part of our EO implementation we have logic to Abort the Transaction and retry if such Exceptions are caught, but in recent examples it seems even that is not working as we would expect. Possibly because of a disconnect between the Producer and Broker?

From looking at our logs we can see in some scenarios it’s taking > 60 seconds to just Init & Begin our Transactions, something that was previously a quick and simple step. This has slowly deteriorated our Microservices that are using the EO & Transactional API to the point where almost all are reporting the above errors and seemingly unable to recover.

How to reproduce

We Init & Begin Transaction as normal, generally passing a transactionTimeout of 60,000ms or more. Timeouts are happening when calling Begin & Commit Transaction.

//Init tranactions
_kafkaProducer.InitTransactions(new TimeSpan(0, 0, 0, 0, transactionTimeout));
_kafkaProducer.BeginTransaction();

//Commit Transaction
_kafkaProducer.CommitTransaction(new TimeSpan(0, 0, 0, 0, transactionTimeout));

Checklist

Please provide the following information:

  • Confluent.Kafka 1.4.4

Below are some logs taken from one of our Microservices, which seems to indicate the Broker being down:

%7|1597134906.825|STATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state UP -> DOWN %7|1597134906.825|STATE|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state UP -> DOWN %7|1597134906.825|STATE|rdkafka#producer-4| [thrd::0/internal]: :0/internal: Broker changed state INIT -> DOWN %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd::0/internal]: :0/internal: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/5: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd::0/internal]: :0/internal: Handle is terminating in state DOWN: 1 refcnts (0x256c870), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd::0/internal]: :0/internal: failed: err: Local: Broker handle destroyed: (errno: Success) %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/5: Handle is terminating in state DOWN: 1 refcnts (0x7fa598002b70), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Handle is terminating in state DOWN: 1 refcnts (0x257a600), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: failed: err: Local: Broker handle destroyed: (errno: Success) %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: failed: err: Local: Broker handle destroyed: (errno: Success))

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:23 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
mhowlettcommented, Aug 12, 2020

added a note to the support ticket.

you’ve found the backdoor to the engineers here 😃. not 100% reliable though.

0reactions
mhowlettcommented, Oct 12, 2022

going through cleaning up issues and i’m closing this because there have been a lot of transaction related fixes post 1.7.0. And there is another (bug in KIP-360 impl.) coming in v1.10. suspect this relates to something that has now been fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

CommitTransaction doesn't fail when the transaction times ...
If there is a long pause between the last call to "Produce" and the call to "CommitTransaction", the transaction is deemed to have...
Read more >
KIP-447: Producer scalability for exactly once semantics
Note that the current default transaction.timeout is set to one minute, which is too long for Kafka Streams EOS use cases. Considering the...
Read more >
Why am I getting InvalidProducerEpochException when no ...
For newer version of Kafka Streams, default transaction.timeout.ms is 10 sec, and thus, a transaction could time out before a commit happens.
Read more >
Kafka Transactions: Part 1: Exactly-Once Messaging
This is because failure scenarios and time outs naturally mean that messages are redelivered to ensure messages are not lost and are ...
Read more >
Best Practices for Using Kafka Sources/Sinks in Flink Jobs
1. Configure Applicable Kafka Transaction Timeouts With End-To-End Exactly-Once Delivery · transaction.max.timeout.ms at the Kafka broker. The ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found