Acked messages unexpectedly redelivered when others are negatively acked
See original GitHub issueDescribe the bug
We’ve encountered an issue in which acknowledged messages are redelivered one or more times after other messages are negatively acknowledged. This seems to occur when messages are produced in batches. This happens in the absence of any known broker or connection failures.
To Reproduce
I’ve modified the NegativeAcksTest
to test for the correct behavior here: https://github.com/apache/pulsar/compare/master...gmethvin:negative-ack-duplicates
As the test demonstrates, in some configurations positively acknowledged messages are redelivered. This is similar to a situation we see in production.
Expected behavior
Only the negatively acknowledged messages should be redelivered. Positively acknowledged messages should not be redelivered, at least not in a typical situation with no failures.
We produce messages in batches, but both the APIs and the documentation suggest that both acks and negative acks act on a per-message level. If negative acks act on batches, then the APIs and documentation should be changed to clearly indicate that.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
Oh, since it is also related to transaction implementation, Penghui and me are already writing a PIP for this. We are going to share the PIP soon.
@jerrypeng @sijie @gmethvin I agree that redelivering batches is OK in most cases. Redelivery of unacked/nacked messages is the exception, not the rule, but when it occurs it should behave in a way that makes sense to the user of the API.
It’s not that hard for the client to filter out the already acked messages from the batch. It is already keeping track of this so it knows when it can ack the batch back to broker. It’s just a matter of using this information to filter received messages.
That approach doesn’t cover all failure cases (ex if client restarts), but in the case where the application is NACKing a message, it would give reasonable behavior. Plus it will reduce the number of duplicate messages that get sent to the client when using batch messages.
@zzzming and I are happy to work on a PR for this if everyone agrees this is the right approach.