question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Many parallel subscriptions in one process => DEADLINE_EXCEEDED upon ack

See original GitHub issue

Environment details

  • OS: GKE
  • Node.js version: 10.13.0
  • @google-cloud/pubsub version: 0.28.1

Steps to reproduce

This is a follow-up report to #240, the symptoms are the same however I think I have a better understanding of what might go wrong this time.

I have made another take on finally upgrading the node pubsub library from v0.18.0 to v0.28.1.

  • I am currently running a couple of pods that subscribe to several pubsub topics.
  • Most of the pods subscribe to only one or two topics, those work fine with the updated library.
  • I have one pod that subscribes to many subscriptions, this one shows problems, see below.

Here’s what the problematic pod does:

  • Starts up by initializing 23 different (pre-existing) subscriptions on 23 different topics.
  • Starts pulling messages that sit in these subscriptions, these are low-frequency subscriptions, so we’re talking about 0-500 messages that sit in the subscriptions initially.
  • Processes those messages in batches and acks() all of them, processing of a batch takes ~500ms.
  • After a few seconds (I’ve seen 5-12), I receive a DEADLINE_EXCEEDED error event on one of the subscriptions where acks have been sent (and I crash the process).

Here’s the error:

Error: Failed to "acknowledge" for 6 message(s). Reason: 4 DEADLINE_EXCEEDED: Deadline Exceeded
    at AckQueue.<anonymous> (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:162:23)
    at Generator.throw (<anonymous>)
    at rejected (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:20:65)

For the affected subscriptions, the num_undelivered_messages stackdriver metric just keeps on rising:

Screenshot 2019-03-13 22 25 11

Stackdriver also indicates that there are indeed streaming_pull_message_operation_count operations (they keep on rising because my pod keeps on crashing and restarting, thus pulling the rising amount of unacked messages over and over again):

Screenshot 2019-03-13 22 26 44

However stackdriver doesn’t show any pull_ack_message_operation_count operations (ignore the spike to the right, this is where I started the pod with only one subscription where it worked):

Screenshot 2019-03-13 22 30 30

Just for comparison, with version v0.18.0 of this client library, that I had been using before, it was still using streamingAcks (see discussion in #240) and everything was working well, even with the 23 parallel subscriptions in one node process:

Screenshot 2019-03-13 22 32 18

All screenshots above show the same subscription and the same timeframe, at time ~4:30pm I deployed the new version, the point where the amount of undelivered messages starts rising.

To me it looks like the amount of parallel subscriptions determines whether acks work as expected or whether I consistently receive DEADLINE_EXCEEDED errors as a result to the ack batches. It’s interesting that acks either work for all subscriptions handled by the affected process or none.

I doubt that it is CPU related because the process does basically only I/O (pipe the messages from Pubsub to BigQuery) and the CPU-graphs of GKE also show no significant CPU usage. CPU usage of the same process with v0.18.0 of the library was also <1%.

Is the number of parallel subscriptions that can be handled in a single node process (all subscription message handlers registered with a single new PubSub() client instance) somehow limited?

Any pointer into what direction I should be looking to would be highly appreciated!

(Next thing I’d like to try out is see if I can narrow down the number of parallel subscriptions in one process where it starts to break… However due to this npm registry issue I have currently trouble deploying new code to the GKE cluster: https://status.npmjs.org/incidents/3p30y0pkblsp)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:24 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
ctavancommented, May 22, 2019

@sduskis totally fine with a safe limit of 50. Just whatever error message that is more descriptive than DEADLINE_EXCEEDED and that throws immediately and not only after a while when the first acks start to fail… 😉

I furthermore suggest to move option 4 into a separate issue and keep this one open only until the error message has been added.

2reactions
ctavancommented, Apr 8, 2019

@callmehiphop @bcoe I’ve extended my reproduction case a bit to show that:

  1. The limit is indeed per new PubSub() instance. If I create a new PubSub() instance for each subscription I can subscribe to 20 subscriptions (or more) with maxStreams: 5 (default) equaling >= 100 concurrent grpc streams.
  2. Publishing on the same new PubSub() instance doesn’t seem to have any effect on the observed subscription behavior. No matter whether I published on the same new PubSub() instance or not before, I can spawn exactly 99 concurrent grpc streams and as soon as I spawn the 100th I get DEADLINE_EXCEEDED errors.
Read more comments on GitHub >

github_iconTop Results From Across the Web

[Python Client] SubscriberClient synchronous pull function ...
The way it works: it uses a subscription to pull a batch of messages. If an exception occurs or no messages are returned,...
Read more >
Google Cloud Pub/Sub Ordered Delivery - Medium
Subscribers connect to one or more partitions and receive messages from those partitions. Much like the publish side, subscribers can choose ...
Read more >
Google cloud pubsub python synchronous pull - Stack Overflow
My application scenario is I want to process messages on different subscribers with specific number of messages to be processed at a time....
Read more >
Why You Should Use errgroup.WithContext() in Your Golang ...
Parallel Processing. We can speed this up immensely by parallelizing all the remote calls. Our handler code is not doing much actual, real...
Read more >
Jobs | Kubernetes
You can also use a Job to run multiple Pods in parallel. If you want to run a Job (either a single task,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found