question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Produced messages with acks=1, no errors in logs, yet messages didn't appear in Kafka

See original GitHub issue

At my day job, we have a service that has a KafkaProducer instance that sends low message volume (a few thousand a week) to a Kafka topic, but all messages are quite valuable and must be received.

The producer is configured with acks=1, the topic has 3 replicas, with min.isr=2. The cluster is currently on kafka version 0.10.0.

Since it’s a low-volume, high-value producer, the service explicitly logs when it produces a message, and it forces the messages to be sent to the broker by calling producer.flush() immediately after producer.send(). I am still doublechecking with the service owner, but IIRC, the service immediately checks the result of the message future to make sure the future has completed and there were no errors reported. If there are any errors, the service logs them. So because we’re using acks=1, we know that for the service to report that a message was produced with no errors, that the message should be present on the broker.

However, twice now in production we’ve had problems that we traced back to the message not being in the kafka topic. The service logs say the message was successfully produced with no reported errors, yet when we use a console consumer to consume everything in the topic, these messages are not present. The service has been running for weeks, so this isn’t an issue tied to producer startup/shutdown. And the Kafka cluster has been stable, with no unclean leader elections.

This issue is mostly a placeholder for me, as I intend to dig into this more deeply but haven’t had time yet. But I also wanted to doublecheck with @dpkp / @tvoinarovskyi that I’m not overlooking any valid scenarios where this behavior is expected.

My hunch at this point is that the messages were lost due to network issues and there’s a bug somewhere in kafka-python related to the acks=1. Since the network issues are infrequent, any bug like that would only rarely be encountered, and for most of our other topics the message volume is so high that if we lost a few messages we probably wouldn’t realize it. So my first steps will be simulating some network failures and seeing what happens.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
jeffwidmancommented, Oct 9, 2017

@tvoinarovskyi I’m afraid I’m slightly confused as this explanation is slightly different than how I thought Producer.flush() worked…

Does Producer.flush() block until the messages were actually sent off the box?

The network error can lead to a duplicate, but should not lead to a loss of the message.

At what layer are you saying this retry happens? If we call Producer.flush() as part of our shutdown code right before we call Producer.close(), and there’s a network error, will the producer block on the flush() and continue to retry the error? Or does flush() only guarantee the initial send, and if we follow that with a close(), the retry on network error will never happen?

I’m happy to look this up in the source too, but if you know offhand I’d appreciate any clarification.

0reactions
jeffwidmancommented, Apr 14, 2019

Producer.flush() is only a signal to finish the batch and send them to Broker, it does not confirm delivery. In case of a network error or a timeout, you have to check the future to get the error. The network error can lead to a duplicate, but should not lead to a loss of the message.

For some reason I’d been interpreting this comment to mean that flush() only guaranteed the messages were sent off the box and that associated future’s could still be in an unresolved state, but that is incorrect–the futures will have resolved: https://github.com/dpkp/kafka-python/blob/eed59ba3b3c8800859572db046f36b5d8bd66487/kafka/producer/record_accumulator.py#L520-L536

This is also mentioned in the flush() docstring: https://github.com/dpkp/kafka-python/blob/8602389bbee5e99296a73700b76bd3e44f0fcf3b/kafka/producer/kafka.py#L612-L632

Just a quick point of clarification for anyone else who stumbled across this.

At this point, reproducing this issue will be near impossible, so I’m going to close this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When you can lose messages in Kafka - Developer 2.0
When a message is sent to the publisher, the publisher waits for an acknowledgment (ACK) from the broker. There are three configuration options ......
Read more >
5 Common Pitfalls When Using Apache Kafka - Confluent
1. Setting request.timeout.ms too low · 2. Misunderstanding producer retries and retriable exceptions. From the broker side: · 3. Leaving key ...
Read more >
Unexpected behaviour of NotEnoughReplicasException with ...
Messages were produced by kafka-console-producer with acks=all and read by kafka-console-consumer; Bought down 2 brokers leaving just 1 insync.
Read more >
How to Lose Messages on a Kafka Cluster - Part 1
The conclusion is that acks=1 loses loads of messages even when a kafka node is only isolated from Zookeeper.
Read more >
Documentation - Apache Kafka
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found