Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Timeout Issues with KafkaEO & Transactions

See original GitHub issue

Description

Since introducing Kafka Exactly Once semantics and Transactions into our platform we have recently seen a sharp rise in errors when trying to use the Transactional API. The most common errors we see are:

Failed to initialise Producer ID: Local: Timed out' with Exception
Operation not valid in state WaitPID, at Confluent.Kafka.Impl.SafeKafkaHandle.BeginTransaction() at Confluent.Kafka.Producer`2.BeginTransaction()
Operation not valid in state AbortableError, at Confluent.Kafka.Impl.SafeKafkaHandle.CommitTransaction(Int32 millisecondsTimeout) at Confluent.Kafka.Producer`2.CommitTransaction(TimeSpan timeout)

All of which seem to indicate a long delay in the Broker/Transaction Coordinator responding back to the Producer’s request to Begin a Transaction.

We had several weeks during testing with little to no occurrences of the above but since going Live (just over 2 weeks ago) we have noticed the numbers starting to rise, with them being present in almost all of our Microservices all of the time now. As part of our EO implementation we have logic to Abort the Transaction and retry if such Exceptions are caught, but in recent examples it seems even that is not working as we would expect. Possibly because of a disconnect between the Producer and Broker?

From looking at our logs we can see in some scenarios it’s taking > 60 seconds to just Init & Begin our Transactions, something that was previously a quick and simple step. This has slowly deteriorated our Microservices that are using the EO & Transactional API to the point where almost all are reporting the above errors and seemingly unable to recover.

How to reproduce

We Init & Begin Transaction as normal, generally passing a transactionTimeout of 60,000ms or more. Timeouts are happening when calling Begin & Commit Transaction.

//Init tranactions
_kafkaProducer.InitTransactions(new TimeSpan(0, 0, 0, 0, transactionTimeout));
_kafkaProducer.BeginTransaction();

//Commit Transaction
_kafkaProducer.CommitTransaction(new TimeSpan(0, 0, 0, 0, transactionTimeout));

Checklist

Please provide the following information:

Confluent.Kafka 1.4.4

Below are some logs taken from one of our Microservices, which seems to indicate the Broker being down:

%7|1597134906.825|STATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state UP -> DOWN %7|1597134906.825|STATE|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state UP -> DOWN %7|1597134906.825|STATE|rdkafka#producer-4| [thrd::0/internal]: :0/internal: Broker changed state INIT -> DOWN %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd::0/internal]: :0/internal: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/5: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd::0/internal]: :0/internal: Handle is terminating in state DOWN: 1 refcnts (0x256c870), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd::0/internal]: :0/internal: failed: err: Local: Broker handle destroyed: (errno: Success) %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/5: Handle is terminating in state DOWN: 1 refcnts (0x7fa598002b70), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Handle is terminating in state DOWN: 1 refcnts (0x257a600), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: failed: err: Local: Broker handle destroyed: (errno: Success) %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: failed: err: Local: Broker handle destroyed: (errno: Success))

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:23 (12 by maintainers)

Top GitHub Comments

1reaction

mhowlettcommented, Aug 12, 2020

added a note to the support ticket.

you’ve found the backdoor to the engineers here 😃. not 100% reliable though.

0reactions

mhowlettcommented, Oct 12, 2022

going through cleaning up issues and i’m closing this because there have been a lot of transaction related fixes post 1.7.0. And there is another (bug in KIP-360 impl.) coming in v1.10. suspect this relates to something that has now been fixed.