question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pub/Sub: buffered messages delivered multiple times ignoring acknowledgment deadline

See original GitHub issue

I noticed a problem where buffered messages in pubsub library are delivered multiple times ignoring acknowledgment deadline. I have a subscription with acknowledgment deadline set to 5 minutes but messages are delivered multiple times to the same subscriber after about 30 seconds.

My code example starts a subscriber client with max_messages set to 1. When two messages are sent to the subscriptions, the second message is always duplicated. This same behavior also happens with higher max_messages values. For example, using the default value 100 for max_messages, when 150 messages are sent to the subscription, some of the messages are duplicated if the processing of the first 100 messages takes more than 30 seconds.

I understand that with large backlog of small messages, the messages can get redelivered (Dealing with large backlogs of small messages), however, shouldn’t the messages be re-delivered only after the ack_deadline is exceeded?

Environment details

  • OS type and version: Ubuntu 18.04.3 LTS
  • Python version and virtual environment information: Python 3.7.3
  • google-cloud-pubsub version: google-cloud-pubsub==1.0.0

Steps to reproduce

  1. Create a subscription with a high ack_deadline.
  2. Start subscriber (see code examples): python subscriber.py -p PROJECT_ID SUBSCRIPTION_NAME.
  3. Send two messages to the pubsub topic.
  4. The second message is duplicated.

Code example

Example subscriber code here: https://gist.github.com/qvik-olli/0bfd4ace2d06def1675a76fbc20493e5

Logs

From the subscriber script:

2019-09-19 14:10:37,740 INFO ThreadPoolExecutor-ThreadScheduler_0: [652240118977134] got message with content: b'0'
2019-09-19 14:10:37,740 INFO ThreadPoolExecutor-ThreadScheduler_0: [652240118977134] sleeping
2019-09-19 14:11:17,747 INFO ThreadPoolExecutor-ThreadScheduler_0: [652240118977134] done sleeping
2019-09-19 14:11:17,819 INFO ThreadPoolExecutor-ThreadScheduler_1: [652240118977135] got message with content: b'1'
2019-09-19 14:11:17,819 INFO ThreadPoolExecutor-ThreadScheduler_1: [652240118977135] sleeping
2019-09-19 14:11:57,859 INFO ThreadPoolExecutor-ThreadScheduler_1: [652240118977135] done sleeping
2019-09-19 14:11:57,986 ERROR ThreadPoolExecutor-ThreadScheduler_0: [652240118977135] Duplicate message!!!

With debug logging from the pubsub library:

2019-09-19 14:10:35,175 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:10:35,176 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 6.301498 seconds.
2019-09-19 14:10:37,634 DEBUG Thread-ConsumeBidirectionalStream: recved response.
2019-09-19 14:10:37,634 DEBUG Thread-ConsumeBidirectionalStream: Processing 2 received message(s), currenty on hold 0.
2019-09-19 14:10:37,740 DEBUG Thread-ConsumeBidirectionalStream: Sent request(s) over unary RPC.
2019-09-19 14:10:37,740 DEBUG Thread-ConsumeBidirectionalStream: Message backlog over load at 1.00, pausing.
2019-09-19 14:10:37,740 DEBUG Thread-ConsumeBidirectionalStream: Scheduling callbacks for 1 new messages, new total on hold 1.
2019-09-19 14:10:37,740 INFO ThreadPoolExecutor-ThreadScheduler_0: [652240118977134] got message with content: b'0'
2019-09-19 14:10:37,740 INFO ThreadPoolExecutor-ThreadScheduler_0: [652240118977134] sleeping
2019-09-19 14:10:37,740 DEBUG Thread-ConsumeBidirectionalStream: paused, waiting for waking.
2019-09-19 14:10:41,477 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:10:41,477 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:10:41,539 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:10:41,539 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 5.628947 seconds.
2019-09-19 14:10:47,168 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:10:47,168 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:10:47,238 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:10:47,238 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 1.232672 seconds.
2019-09-19 14:10:48,471 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:10:48,471 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:10:48,584 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:10:48,584 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 0.579531 seconds.
2019-09-19 14:10:49,164 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:10:49,164 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:10:49,242 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:10:49,242 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 5.177725 seconds.
2019-09-19 14:10:54,420 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:10:54,420 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:10:54,477 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:10:54,477 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 6.448523 seconds.
2019-09-19 14:11:00,926 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:11:00,926 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:11:01,002 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:11:01,002 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 7.749161 seconds.
2019-09-19 14:11:02,469 DEBUG Thread-Heartbeater: Sent heartbeat.
2019-09-19 14:11:08,752 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:11:08,752 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:11:08,813 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:11:08,813 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 4.640038 seconds.
2019-09-19 14:11:13,454 DEBUG Thread-LeaseMaintainer: The current p99 value is 10 seconds.
2019-09-19 14:11:13,454 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:11:13,532 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:11:13,532 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 4.966822 seconds.
2019-09-19 14:11:17,747 INFO ThreadPoolExecutor-ThreadScheduler_0: [652240118977134] done sleeping
2019-09-19 14:11:17,748 DEBUG Thread-CallbackRequestDispatcher: Handling 1 batched requests
2019-09-19 14:11:17,818 DEBUG Thread-CallbackRequestDispatcher: Sent request(s) over unary RPC.
2019-09-19 14:11:17,818 DEBUG Thread-CallbackRequestDispatcher: Current load: 0.00
2019-09-19 14:11:17,818 DEBUG Thread-CallbackRequestDispatcher: Released held message to leaser, scheduling callback for it, still on hold 0.
2019-09-19 14:11:17,819 INFO ThreadPoolExecutor-ThreadScheduler_1: [652240118977135] got message with content: b'1'
2019-09-19 14:11:17,819 INFO ThreadPoolExecutor-ThreadScheduler_1: [652240118977135] sleeping
2019-09-19 14:11:17,819 DEBUG Thread-CallbackRequestDispatcher: Did not resume, current load is 1.00.
2019-09-19 14:11:18,499 DEBUG Thread-LeaseMaintainer: The current p99 value is 41 seconds.
2019-09-19 14:11:18,499 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:11:18,575 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:11:18,575 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 18.568330 seconds.
2019-09-19 14:11:32,469 DEBUG Thread-Heartbeater: Sent heartbeat.
2019-09-19 14:11:37,144 DEBUG Thread-LeaseMaintainer: The current p99 value is 41 seconds.
2019-09-19 14:11:37,144 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:11:37,470 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:11:37,470 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 5.602221 seconds.
2019-09-19 14:11:43,072 DEBUG Thread-LeaseMaintainer: The current p99 value is 41 seconds.
2019-09-19 14:11:43,072 DEBUG Thread-LeaseMaintainer: Renewing lease for 1 ack IDs.
2019-09-19 14:11:43,146 DEBUG Thread-LeaseMaintainer: Sent request(s) over unary RPC.
2019-09-19 14:11:43,146 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 22.866717 seconds.
2019-09-19 14:11:57,859 INFO ThreadPoolExecutor-ThreadScheduler_1: [652240118977135] done sleeping
2019-09-19 14:11:57,859 DEBUG Thread-CallbackRequestDispatcher: Handling 1 batched requests
2019-09-19 14:11:57,923 DEBUG Thread-CallbackRequestDispatcher: Sent request(s) over unary RPC.
2019-09-19 14:11:57,924 DEBUG Thread-CallbackRequestDispatcher: Current load: 0.00
2019-09-19 14:11:57,924 DEBUG Thread-CallbackRequestDispatcher: Current load is 0.00, resuming consumer.
2019-09-19 14:11:57,924 DEBUG Thread-ConsumeBidirectionalStream: woken.
2019-09-19 14:11:57,924 DEBUG Thread-ConsumeBidirectionalStream: waiting for recv.
2019-09-19 14:11:57,924 DEBUG Thread-ConsumeBidirectionalStream: recved response.
2019-09-19 14:11:57,924 DEBUG Thread-ConsumeBidirectionalStream: Processing 1 received message(s), currenty on hold 0.
2019-09-19 14:11:57,985 DEBUG Thread-ConsumeBidirectionalStream: Sent request(s) over unary RPC.
2019-09-19 14:11:57,986 DEBUG Thread-ConsumeBidirectionalStream: Message backlog over load at 1.00, pausing.
2019-09-19 14:11:57,986 DEBUG Thread-ConsumeBidirectionalStream: Scheduling callbacks for 1 new messages, new total on hold 0.
2019-09-19 14:11:57,986 ERROR ThreadPoolExecutor-ThreadScheduler_0: [652240118977135] Duplicate message!!!
2019-09-19 14:11:57,986 DEBUG Thread-ConsumeBidirectionalStream: paused, waiting for waking.
2019-09-19 14:11:57,986 DEBUG Thread-CallbackRequestDispatcher: Handling 1 batched requests
2019-09-19 14:11:58,087 DEBUG Thread-CallbackRequestDispatcher: Sent request(s) over unary RPC.
2019-09-19 14:11:58,088 DEBUG Thread-CallbackRequestDispatcher: Current load: 0.00
2019-09-19 14:11:58,088 DEBUG Thread-CallbackRequestDispatcher: Current load is 0.00, resuming consumer.
2019-09-19 14:11:58,088 DEBUG Thread-ConsumeBidirectionalStream: woken.
2019-09-19 14:11:58,088 DEBUG Thread-ConsumeBidirectionalStream: waiting for recv.
2019-09-19 14:12:02,470 DEBUG Thread-Heartbeater: Sent heartbeat.
2019-09-19 14:12:06,013 DEBUG Thread-LeaseMaintainer: The current p99 value is 81 seconds.
2019-09-19 14:12:06,013 DEBUG Thread-LeaseMaintainer: Snoozing lease management for 71.733369 seconds.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:20 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
plamutcommented, Sep 20, 2019

After experimenting with various things, I was not able to pinpoint the exact source of the issue (in the client at least).

It appears that if the delay in the message handler is considerably shorter than 10 seconds, the issue is not reproducible. The probability of a duplicate message increases when the time.sleep() interval is around 10 seconds (8 - 12), and approaches one with delays longer than that.

I also tried disabling the automatic modifications of the ACK deadlines, just in case the related client’s logic is flawed:

diff --git a/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py b/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py
index 2b257482930..d6b1d2c54d0 100644
--- a/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py
+++ b/pubsub/google/cloud/pubsub_v1/subscriber/_protocol/dispatcher.py
@@ -152,7 +156,10 @@ class Dispatcher(object):
         """
         ack_ids = [item.ack_id for item in items]
         seconds = [item.seconds for item in items]
-
+        #################
+        _LOGGER.debug(f"\x1b[33m(DISABLED) Modifying ACK deadline to {seconds}, ACK IDs: {ack_ids}\x1b[0m")
+        return
+        ######################
         request = types.StreamingPullRequest(
             modify_deadline_ack_ids=ack_ids, modify_deadline_seconds=seconds
         )

The result was the same, which makes me think it might be something with the subscription on the backend (the ACK deadline is set to 300 seconds), and that the deadline expected by the server is actually around 10 seconds, in my case at least.

I did some further testing and it seems that there is some dynamic timeout that seems to be changing depending on how long does it take to process a batch of messages. This timeout seems to ignore ack deadline.

I got a similar impression. The actual server-side deadline does not seem to be static. If it was really, say, 10 seconds, a sleep delay of 8 seconds should almost definitely not reproduce the issue, and at the same time a delay of 12 seconds should always reproduce it - but the outcome of the test script was not deterministic.

Labeling as a backend issue until further evidence.

0reactions
plamutcommented, Jul 20, 2021

Sounds right, I’ll create a feature request issue in the new repo.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Replaying and purging messages | Cloud Pub/Sub ...
You can configure messages to be retained for a maximum of 7 days in a subscription. This configuration applies to both acknowledged and...
Read more >
Pub/Sub - Acknowledgement deadline is ignored when used ...
Pub/Sub usually redeliver the message when it is not acknowledged after the acknowledgement deadline, but this is not guaranteed.
Read more >
pubsub - Go Packages
Ack deadlines are extended periodically by the client. The initial ack deadline given to messages is based on the subscription's AckDeadline property, which ......
Read more >
Consumer Acknowledgements and Publisher Confirms
When a node delivers a message to a consumer, it has to decide whether the message should be considered handled (or at least...
Read more >
Documentation - Apache Kafka
This is nothing more than publish-subscribe semantics where the subscriber ... Messages sent by a producer to a particular topic partition will be...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found