Publishing messages burst timeout
See original GitHub issueEnvironment details
- OS: Google Kubernetes Container
- Node.js version: 12.15.0
- npm version: -
@google-cloud/pubsub
version: 2.18.3
Steps to reproduce
- ?
- ?
We’re seeing an issue in our production environment. It happens pretty inconsistently, so I’m not sure of how exactly to reproduce it.
This service publishes messages to a couple of topics consistently, and the publishing message volume is around 1 MiB per second. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 150 pods on this service). For example, we’ll see a burst of ~5k errors for all of the topics coming from pod A, and the next day we’ll see that from pod B. It happens in several hours or days. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors aren’t resolved by themselves in a short time, at least aren’t within 20 minutes.
BTW, the pubsub instance is created once and reused for subsequent publishes.
The error message and stack:
Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received.
at repeat (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
at Timeout._onTimeout (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
Thanks! Please let me know what other information would be helpful.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:17 (2 by maintainers)
Top GitHub Comments
This looks like a client-side issue. All pods but one are able to send requests to Pub/Sub and get a response back. Removing the bad pod is a good temporary fix. To fix this for good, we need to know what the bad pod is doing with those failed requests. Was it unable to send the requests in the first place? Or was it unable to receive the responses back. Chances are the client might have exhausted its network connections? Or, the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one. How resilient is your pod in handling error conditions? Can you share a code snippet on how messages are being published in the pod, and how it handles error conditions?
@feywind Where can we view that linked issue?