Produced messages with acks=1, no errors in logs, yet messages didn't appear in Kafka
See original GitHub issueAt my day job, we have a service that has a KafkaProducer
instance that sends low message volume (a few thousand a week) to a Kafka topic, but all messages are quite valuable and must be received.
The producer is configured with acks=1
, the topic has 3 replicas, with min.isr=2
. The cluster is currently on kafka version 0.10.0.
Since it’s a low-volume, high-value producer, the service explicitly logs when it produces a message, and it forces the messages to be sent to the broker by calling producer.flush()
immediately after producer.send()
. I am still doublechecking with the service owner, but IIRC, the service immediately checks the result of the message future
to make sure the future
has completed and there were no errors reported. If there are any errors, the service logs them. So because we’re using acks=1
, we know that for the service to report that a message was produced with no errors, that the message should be present on the broker.
However, twice now in production we’ve had problems that we traced back to the message not being in the kafka topic. The service logs say the message was successfully produced with no reported errors, yet when we use a console consumer to consume everything in the topic, these messages are not present. The service has been running for weeks, so this isn’t an issue tied to producer startup/shutdown. And the Kafka cluster has been stable, with no unclean leader elections.
This issue is mostly a placeholder for me, as I intend to dig into this more deeply but haven’t had time yet. But I also wanted to doublecheck with @dpkp / @tvoinarovskyi that I’m not overlooking any valid scenarios where this behavior is expected.
My hunch at this point is that the messages were lost due to network issues and there’s a bug somewhere in kafka-python related to the acks=1
. Since the network issues are infrequent, any bug like that would only rarely be encountered, and for most of our other topics the message volume is so high that if we lost a few messages we probably wouldn’t realize it. So my first steps will be simulating some network failures and seeing what happens.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5
@tvoinarovskyi I’m afraid I’m slightly confused as this explanation is slightly different than how I thought
Producer.flush()
worked…Does
Producer.flush()
block until the messages were actually sent off the box?At what layer are you saying this retry happens? If we call
Producer.flush()
as part of our shutdown code right before we callProducer.close()
, and there’s a network error, will the producer block on theflush()
and continue to retry the error? Or doesflush()
only guarantee the initial send, and if we follow that with aclose()
, the retry on network error will never happen?I’m happy to look this up in the source too, but if you know offhand I’d appreciate any clarification.
For some reason I’d been interpreting this comment to mean that
flush()
only guaranteed the messages were sent off the box and that associatedfuture
’s could still be in an unresolved state, but that is incorrect–thefuture
s will have resolved: https://github.com/dpkp/kafka-python/blob/eed59ba3b3c8800859572db046f36b5d8bd66487/kafka/producer/record_accumulator.py#L520-L536This is also mentioned in the
flush()
docstring: https://github.com/dpkp/kafka-python/blob/8602389bbee5e99296a73700b76bd3e44f0fcf3b/kafka/producer/kafka.py#L612-L632Just a quick point of clarification for anyone else who stumbled across this.
At this point, reproducing this issue will be near impossible, so I’m going to close this.