Stuck consumer: terminate called after throwing an instance of 'std::bad_alloc'
See original GitHub issueDescribe the bug A consumer gets stuck after seeing an error mostly probably coming from the cgo layer underneath:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
To Reproduce Steps to reproduce the behavior:
- Shell into Kubernetes pod and reset the cursor for a given subscription like
bin/pulsar-admin topics reset-cursor persistent://public/default/SpaceEvents -s cloud-notifications-service -t 999w
- The consumer is able to read a few messages and then eventually fails with the above error. It doesn’t seem to be trying anything (e.g. reconnection, termination…), it just gets stuck
- If we terminate the service manually it then resumes consuming and then after a while it eventually gets stuck again
Expected behavior
I would expect it to not block and to not raise any bad alloc
error.
Screenshots No screenshots available.
Desktop (please complete the following information):
- OS: Kubernetes on GCP
Additional context
I cross referenced the logs of our consumer to see what happens on the Pulsar side when we get the bad alloc
errors and we were able to find some interesting exceptions that seem to happen concomitantly with the bad alloc
errors (see attached report).
Some errors are particularly interesting and make me think that we might have issues when reading entries from a ledger (bookkeeper). Is there anything you can suggest on how to better debug this? Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:37 (18 by maintainers)
Top GitHub Comments
After further investigation I was able to come across two scenarios, which may be related.
Silent broker scenario
stats
tool reports several messages in themsgBacklog
for the given topic/subscription and it shows all the consumers as connected but with amsgRateOut
of 0flow
stage and then hang forever waiting for a MESSAGE command that never comes, meaning they connect to the broker, send the 1000 permits and then just get pings from here, no messages whatsoeverThis is the Python code I used:
And the
requirements.txt
:Possible acking deadlock (Golang client)
trace
)stats
tool reports no consumers at all despite all clients are print the PING/PONG successfully in their logsstats
tool reports amsgBacklog
greater than zero so there are messages waiting to be processedTrying to ack
on the logs whenever it was taking more than 150msThe last bullet point means that I change this code:
Like this:
If the ack happens within 150ms we don’t see any logs. The problem is that now I have all consumers stuck in a never ending loop just printing:
This has been going on for several hours and the message ID of the log entry is always the same until I kill the consumer.
Could it be that, given that the
eventsCh
is a buffered channel of 3, when a connection gets closed due to a message frame size that is too big then therunEventsLoop()
never gets to process the*connectionClosed
event due to at least 3 in-flight ack requests?Meaning: we could have 3 ack requests that are already keeping the channel full, the connection gets closed, the acks can’t be processed because the connection was closed and we cannot reconnect to the broker because, due to the channel being full, we can’t push
*connectionClosed
event into theeventsCh
thus it never gets processed = deadlock?In support of this theory I can see this in the logs right before I get the never ending
Trying to ack
loop:Also by analyzing the stack dump I can see that one of the 21 goroutines running is stuck waiting here.
Killing the consumer in this case helps but it eventually gets stuck again trying to ack some other message.
We could potentially try to look into the second scenario but have no clue whatsoever about the first one and we have no Java expertise. Can someone please look a bit more into this and tell us whether you need more information? Thanks.
@codelipenghui @gaoran10 We created a clone of the whole production environment with the offloaded buckets too and, even with 2.6.1, we’re perfectly able to replicate. In fact the consumers are still getting stuck. We’re able to unlock them only by manually killing the pod on Kubernetes.
I deployed 3 consumers with a key shared subscription listening to our
SpaceEvents
topic. Once all three are running I reset the consumers offset for that subscription to by doingbin/pulsar-admin topics reset-cursor SpaceEvents -s cloud-pulsar-tester -t 99999w
. After the reset I can see that the 3 consumers started reading messages. ThemsgRateOut
showed in thebin/pulsar-admin topics stats SpaceEvents
also shows that all 3 are reading messages.At some point, when there are around 181k messages left in the backlog, the consumers get stuck.
Golang µservices logs report warnings only, coming from the Pulsar library itself (which uses logrus), they are not coming from our code:
Pulsar logs report the following (by
grep
ping byexception
).Broker
Proxy
When inspecting the subscription, via the
pulsar-admin
cli tool, we see no consumers at all despite having 3 services running (and using your Pulsar Golang library - the native one, no cgo):$ bin/pulsar-admin topics stats SpaceEvents | jq '.subscriptions["cloud-pulsar-tester"]' -c | jq
If we inspect the Golang µservices logs with a lower debug level we see logs coming from the underlying Pulsar Golang library as shown below (please note that the following logs are regularly printed also after the consumers get stuck):