Many parallel subscriptions in one process => DEADLINE_EXCEEDED upon ack
See original GitHub issueEnvironment details
- OS: GKE
- Node.js version: 10.13.0
@google-cloud/pubsub
version: 0.28.1
Steps to reproduce
This is a follow-up report to #240, the symptoms are the same however I think I have a better understanding of what might go wrong this time.
I have made another take on finally upgrading the node pubsub library from v0.18.0
to v0.28.1
.
- I am currently running a couple of pods that subscribe to several pubsub topics.
- Most of the pods subscribe to only one or two topics, those work fine with the updated library.
- I have one pod that subscribes to many subscriptions, this one shows problems, see below.
Here’s what the problematic pod does:
- Starts up by initializing 23 different (pre-existing) subscriptions on 23 different topics.
- Starts pulling messages that sit in these subscriptions, these are low-frequency subscriptions, so we’re talking about 0-500 messages that sit in the subscriptions initially.
- Processes those messages in batches and acks() all of them, processing of a batch takes ~500ms.
- After a few seconds (I’ve seen 5-12), I receive a
DEADLINE_EXCEEDED
error event on one of the subscriptions where acks have been sent (and I crash the process).
Here’s the error:
Error: Failed to "acknowledge" for 6 message(s). Reason: 4 DEADLINE_EXCEEDED: Deadline Exceeded
at AckQueue.<anonymous> (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:162:23)
at Generator.throw (<anonymous>)
at rejected (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:20:65)
For the affected subscriptions, the num_undelivered_messages
stackdriver metric just keeps on rising:
Stackdriver also indicates that there are indeed streaming_pull_message_operation_count
operations (they keep on rising because my pod keeps on crashing and restarting, thus pulling the rising amount of unacked messages over and over again):
However stackdriver doesn’t show any pull_ack_message_operation_count
operations (ignore the spike to the right, this is where I started the pod with only one subscription where it worked):
Just for comparison, with version v0.18.0
of this client library, that I had been using before, it was still using streamingAcks
(see discussion in #240) and everything was working well, even with the 23 parallel subscriptions in one node process:
All screenshots above show the same subscription and the same timeframe, at time ~4:30pm I deployed the new version, the point where the amount of undelivered messages starts rising.
To me it looks like the amount of parallel subscriptions determines whether acks work as expected or whether I consistently receive DEADLINE_EXCEEDED
errors as a result to the ack
batches. It’s interesting that acks either work for all subscriptions handled by the affected process or none.
I doubt that it is CPU related because the process does basically only I/O (pipe the messages from Pubsub to BigQuery) and the CPU-graphs of GKE also show no significant CPU usage. CPU usage of the same process with v0.18.0
of the library was also <1%.
Is the number of parallel subscriptions that can be handled in a single node process (all subscription message handlers registered with a single new PubSub()
client instance) somehow limited?
Any pointer into what direction I should be looking to would be highly appreciated!
(Next thing I’d like to try out is see if I can narrow down the number of parallel subscriptions in one process where it starts to break… However due to this npm registry issue I have currently trouble deploying new code to the GKE cluster: https://status.npmjs.org/incidents/3p30y0pkblsp)
Issue Analytics
- State:
- Created 5 years ago
- Comments:24 (18 by maintainers)
Top GitHub Comments
@sduskis totally fine with a safe limit of 50. Just whatever error message that is more descriptive than
DEADLINE_EXCEEDED
and that throws immediately and not only after a while when the first acks start to fail… 😉I furthermore suggest to move option 4 into a separate issue and keep this one open only until the error message has been added.
@callmehiphop @bcoe I’ve extended my reproduction case a bit to show that:
new PubSub()
instance. If I create anew PubSub()
instance for each subscription I can subscribe to20
subscriptions (or more) withmaxStreams: 5
(default) equaling>= 100
concurrent grpc streams.new PubSub()
instance doesn’t seem to have any effect on the observed subscription behavior. No matter whether I published on the samenew PubSub()
instance or not before, I can spawn exactly99
concurrent grpc streams and as soon as I spawn the100
th I getDEADLINE_EXCEEDED
errors.