question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stuck consumer: terminate called after throwing an instance of 'std::bad_alloc'

See original GitHub issue

Describe the bug A consumer gets stuck after seeing an error mostly probably coming from the cgo layer underneath:

terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

To Reproduce Steps to reproduce the behavior:

  1. Shell into Kubernetes pod and reset the cursor for a given subscription like bin/pulsar-admin topics reset-cursor persistent://public/default/SpaceEvents -s cloud-notifications-service -t 999w
  2. The consumer is able to read a few messages and then eventually fails with the above error. It doesn’t seem to be trying anything (e.g. reconnection, termination…), it just gets stuck
  3. If we terminate the service manually it then resumes consuming and then after a while it eventually gets stuck again

Expected behavior I would expect it to not block and to not raise any bad alloc error.

Screenshots No screenshots available.

Desktop (please complete the following information):

  • OS: Kubernetes on GCP

Additional context I cross referenced the logs of our consumer to see what happens on the Pulsar side when we get the bad alloc errors and we were able to find some interesting exceptions that seem to happen concomitantly with the bad alloc errors (see attached report).

Some errors are particularly interesting and make me think that we might have issues when reading entries from a ledger (bookkeeper). Is there anything you can suggest on how to better debug this? Thanks!

report.txt

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:37 (18 by maintainers)

github_iconTop GitHub Comments

2reactions
fracasulacommented, Aug 31, 2020

After further investigation I was able to come across two scenarios, which may be related.

Silent broker scenario

  • all consumers are connected with PING and PONG responses travelling over the wire
  • the stats tool reports several messages in the msgBacklog for the given topic/subscription and it shows all the consumers as connected but with a msgRateOut of 0
  • by looking at the Pulsar binary protocol I can say that everything goes smootly for all consumers, even if restarted, up to the flow stage and then hang forever waiting for a MESSAGE command that never comes, meaning they connect to the broker, send the 1000 permits and then just get pings from here, no messages whatsoever
  • killing the Pulsar proxy or restarting the consumers do not help
  • restarting the Brokers fix the issue
  • when trying with the official latest Python 2.6.1 client, it just dies as shown below
2020-08-31 08:37:07.728 INFO  [139696078108480] Client:88 | Subscribing on Topic :SpaceEvents
2020-08-31 08:37:07.728 INFO  [139696078108480] ConnectionPool:85 | Created connection for pulsar://localhost:6650
2020-08-31 08:37:07.729 INFO  [139696048822016] ClientConnection:343 | [[::1]:57112 -> [::1]:6650] Connected to broker
2020-08-31 08:37:08.286 INFO  [139696048822016] HandlerBase:53 | [persistent://public/default/SpaceEvents, cloud-pulsar-tester, 0] Getting connection from pool
2020-08-31 08:37:08.426 INFO  [139696048822016] ConnectionPool:85 | Created connection for pulsar://10.56.3.23:6650
2020-08-31 08:37:08.427 INFO  [139696048822016] ClientConnection:345 | [[::1]:57114 -> [::1]:6650] Connected to broker through proxy. Logical broker: pulsar://10.56.3.23:6650
2020-08-31 08:37:08.968 WARN  [139696048822016] ClientConnection:947 | [[::1]:57114 -> [::1]:6650] Received error response from server: UnknownError -- req_id: 0
2020-08-31 08:37:08.968 ERROR [139696048822016] ConsumerImpl:242 | [persistent://public/default/SpaceEvents, cloud-pulsar-tester, 0] Failed to create consumer: UnknownError
Traceback (most recent call last):
  File "main.py", line 4, in <module>
    consumer = client.subscribe('SpaceEvents', 'cloud-pulsar-tester')
  File "/home/francesco/.local/lib/python3.8/site-packages/pulsar/__init__.py", line 655, in subscribe
    c._consumer = self._client.subscribe(topic, subscription_name, conf)
Exception: Pulsar error: UnknownError
2020-08-31 08:37:08.974 INFO  [139696078108480] ClientConnection:1387 | [[::1]:57114 -> [::1]:6650] Connection closed
2020-08-31 08:37:08.974 INFO  [139696078108480] ClientConnection:1387 | [[::1]:57112 -> [::1]:6650] Connection closed
2020-08-31 08:37:08.974 INFO  [139696078108480] ClientConnection:238 | [[::1]:57114 -> [::1]:6650] Destroyed connection
2020-08-31 08:37:08.974 INFO  [139696078108480] ClientConnection:238 | [[::1]:57112 -> [::1]:6650] Destroyed connection

This is the Python code I used:

import pulsar

client = pulsar.Client('pulsar://localhost:6650')
consumer = client.subscribe('SpaceEvents', 'cloud-pulsar-tester')


def consume():
    while True:
        msg = consumer.receive()
        try:
            print("Received message '{}' id='{}'".format(msg.data(), msg.message_id()))
            # Acknowledge successful processing of the message
            consumer.acknowledge(msg)
        except:
            # Message failed to be processed
            consumer.negative_acknowledge(msg)


if __name__ == '__main__':
    consume()
    client.close()

And the requirements.txt:

pulsar-client==2.6.1
apache-bookkeeper-client==4.11.0
grpcio<1.26.0

Possible acking deadlock (Golang client)

  • all consumers report PING and PONG responses in their logs (verbosity set to trace)
  • the stats tool reports no consumers at all despite all clients are print the PING/PONG successfully in their logs
  • the stats tool reports a msgBacklog greater than zero so there are messages waiting to be processed
  • by getting a full goroutine stack dump I was able to determine that all consumers are stuck here
  • to be 100% sure so that it wouldn’t just be a case of having the service trying to ack every time I was getting the full goroutine stack dump, I made changes to the Golang client by adding a ticker that prints Trying to ack on the logs whenever it was taking more than 150ms

The last bullet point means that I change this code:

func (pc *partitionConsumer) runEventsLoop() {
	defer func() {
		pc.log.Debug("exiting events loop")
	}()
	for {
		select {
		case <-pc.closeCh:
			return
		case i := <-pc.eventsCh:
			switch v := i.(type) {
			case *ackRequest:
				pc.internalAck(v)

Like this:

func (pc *partitionConsumer) runEventsLoop() {
	defer func() {
		pc.log.Debug("exiting events loop")
	}()
	for {
		select {
		case <-pc.closeCh:
			return
		case i := <-pc.eventsCh:
			switch v := i.(type) {
			case *ackRequest:
				ctx, cancel := context.WithCancel(context.Background())
				go func(v *ackRequest) {
					for {
						select {
						case <-ctx.Done():
							return
						default:
							pc.log.Infof("Trying to ack %+v (%d - %d)", v.msgID, len(pc.eventsCh), cap(pc.eventsCh))
							time.Sleep(150 * time.Millisecond)
						}
					}
				}(v)

				pc.internalAck(v)
				cancel()

If the ack happens within 150ms we don’t see any logs. The problem is that now I have all consumers stuck in a never ending loop just printing:

{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:11+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:11+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:11+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:12+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:12+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:12+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:12+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T18:52:12+02:00","topic":"persistent://public/default/SpaceEvents"}

This has been going on for several hours and the message ID of the log entry is always the same until I kill the consumer.

Could it be that, given that the eventsCh is a buffered channel of 3, when a connection gets closed due to a message frame size that is too big then the runEventsLoop() never gets to process the *connectionClosed event due to at least 3 in-flight ack requests?

Meaning: we could have 3 ack requests that are already keeping the channel full, the connection gets closed, the acks can’t be processed because the connection was closed and we cannot reconnect to the broker because, due to the channel being full, we can’t push *connectionClosed event into the eventsCh thus it never gets processed = deadlock?

In support of this theory I can see this in the logs right before I get the never ending Trying to ack loop:

{"error":"write tcp [::1]:35530-\u003e[::1]:6650: use of closed network connection","level":"warn","local_addr":{"IP":"::1","Port":35530,"Zone":""},"message":"Failed to write on connection","remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"localhost:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"severity":"WARNING","time":"2020-08-31T17:47:23+02:00"}
{"level":"debug","local_addr":{"IP":"::1","Port":35530,"Zone":""},"message":"Write data: 25","remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"localhost:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"severity":"DEBUG","time":"2020-08-31T17:47:23+02:00"}
{"error":"write tcp [::1]:35530-\u003e[::1]:6650: use of closed network connection","level":"warn","local_addr":{"IP":"::1","Port":35530,"Zone":""},"message":"Failed to write on connection","remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"localhost:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"severity":"WARNING","time":"2020-08-31T17:47:23+02:00"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T17:47:23+02:00","topic":"persistent://public/default/SpaceEvents"}
{"level":"info","local_addr":{"IP":"::1","Port":35530,"Zone":""},"message":"Connection closed","remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"localhost:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"severity":"INFO","time":"2020-08-31T17:47:23+02:00"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T17:47:23+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T17:47:24+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T17:47:24+02:00","topic":"persistent://public/default/SpaceEvents"}
{"consumerID":1,"level":"info","message":"Trying to ack {messageID:{ledgerID:122209 entryID:146 batchIdx:0 partitionIdx:0} tracker:\u003cnil\u003e consumer:0xc000336000 receivedTime:{wall:13820250669831506541 ext:18161550576 loc:0x14dc660}} (3 - 3)","name":"nebuchadnezzar","severity":"INFO","subscription":"cloud-pulsar-tester","time":"2020-08-31T17:47:24+02:00","topic":"persistent://public/default/SpaceEvents"}

Also by analyzing the stack dump I can see that one of the 21 goroutines running is stuck waiting here.

1 @ 0x449d3b 0x41229f 0x412095 0xc5b3f5 0xc46394 0xc4ceb9 0xc41f5e 0xc4ccc6 0x47d571
#	0xc5b3f4	github.com/apache/pulsar-client-go/pulsar.(*partitionConsumer).ConnectionClosed+0x64	/home/francesco/Code/netdata/cloud-pulsar-tester/pulsar-client-go/pulsar/consumer_partition.go:567
#	0xc46393	github.com/apache/pulsar-client-go/pulsar/internal.(*connection).Close+0x583		/home/francesco/Code/netdata/cloud-pulsar-tester/pulsar-client-go/pulsar/internal/connection.go:751
#	0xc4ceb8	github.com/apache/pulsar-client-go/pulsar/internal.(*connection).run.func1+0x168	/home/francesco/Code/netdata/cloud-pulsar-tester/pulsar-client-go/pulsar/internal/connection.go:363
#	0xc41f5d	github.com/apache/pulsar-client-go/pulsar/internal.(*connection).run+0x2bd		/home/francesco/Code/netdata/cloud-pulsar-tester/pulsar-client-go/pulsar/internal/connection.go:369
#	0xc4ccc5	github.com/apache/pulsar-client-go/pulsar/internal.(*connection).start.func1+0x85	/home/francesco/Code/netdata/cloud-pulsar-tester/pulsar-client-go/pulsar/internal/connection.go:231

Killing the consumer in this case helps but it eventually gets stuck again trying to ack some other message.

We could potentially try to look into the second scenario but have no clue whatsoever about the first one and we have no Java expertise. Can someone please look a bit more into this and tell us whether you need more information? Thanks.

2reactions
fracasulacommented, Aug 27, 2020

@codelipenghui @gaoran10 We created a clone of the whole production environment with the offloaded buckets too and, even with 2.6.1, we’re perfectly able to replicate. In fact the consumers are still getting stuck. We’re able to unlock them only by manually killing the pod on Kubernetes.

I deployed 3 consumers with a key shared subscription listening to our SpaceEvents topic. Once all three are running I reset the consumers offset for that subscription to by doing bin/pulsar-admin topics reset-cursor SpaceEvents -s cloud-pulsar-tester -t 99999w. After the reset I can see that the 3 consumers started reading messages. The msgRateOut showed in the bin/pulsar-admin topics stats SpaceEvents also shows that all 3 are reading messages.

At some point, when there are around 181k messages left in the backlog, the consumers get stuck.

Golang µservices logs report warnings only, coming from the Pulsar library itself (which uses logrus), they are not coming from our code:

{ remote_addr: { ForceQuery: false Host: "pulsar-proxy.pulsar:6650" RawPath: "" User: null Fragment: "" Opaque: "" RawQuery: "" Path: "" Scheme: "pulsar" RawFragment: "" } level: "warn" local_addr: { Port: 44814 IP: "10.56.0.18" Zone: "" } }
{ remote_addr: { RawPath: "" Scheme: "pulsar" Path: "" Opaque: "" RawQuery: "" Host: "pulsar-proxy.pulsar:6650" RawFragment: "" Fragment: "" User: null ForceQuery: false } level: "warn" error: "write tcp 10.56.0.18:44814->10.0.1.68:6650: use of closed network connection" local_addr: { Port: 44814 Zone: "" IP: "10.56.0.18" } }
{ remote_addr: { RawQuery: "" Opaque: "" ForceQuery: false RawFragment: "" Scheme: "pulsar" Fragment: "" RawPath: "" Path: "" Host: "pulsar-proxy.pulsar:6650" User: null } local_addr: { Zone: "" Port: 42612 IP: "10.56.0.21" } level: "warn" }
{ level: "warn" remote_addr: { User: null Path: "" Scheme: "pulsar" RawQuery: "" Fragment: "" Opaque: "" Host: "pulsar-proxy.pulsar:6650" ForceQuery: false RawPath: "" RawFragment: "" } local_addr: { Zone: "" IP: "10.56.0.21" Port: 42612 } error: "write tcp 10.56.0.21:42612->10.0.1.68:6650: use of closed network connection" }
{ local_addr: { Zone: "" Port: 51340 IP: "10.56.0.17" } remote_addr: { User: null ForceQuery: false RawPath: "" RawFragment: "" RawQuery: "" Path: "" Host: "pulsar-proxy.pulsar:6650" Opaque: "" Fragment: "" Scheme: "pulsar" } level: "warn" }

Pulsar logs report the following (by grepping by exception).

Broker

java.lang.NullPointerException: null

org.apache.bookkeeper.mledger.ManagedLedgerException: Unknown exception

08:25:35.597 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://public/default/SpaceEvents / cloud-pulsar-tester] Error reading entries at 1430:0 : Unknown exception, Read Type Replay - Retrying to read in 15.0 seconds

java.lang.NullPointerException: null
	at org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1776) ~[org.apache.pulsar-pulsar-common-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:87) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

java.lang.NullPointerException: null
	at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:125) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

java.lang.NullPointerException: null
	at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

08:27:09.703 [bookkeeper-ml-workers-OrderedExecutor-3-0] WARN  org.apache.bookkeeper.mledger.impl.OpReadEntry - [public/default/persistent/SpaceEvents][cloud-pulsar-tester] read failed from ledger at position:2193:0 : Unknown exception

08:27:09.703 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers - [persistent://public/default/SpaceEvents / cloud-pulsar-tester] Error reading entries at 2193:0 : Unknown exception, Read Type Normal - Retrying to read in 15.0 seconds

java.lang.NullPointerException: null
	at org.apache.pulsar.common.protocol.Commands.peekMessageMetadata(Commands.java:1776) ~[org.apache.pulsar-pulsar-common-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:87) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

java.lang.NullPointerException: null
	at org.apache.pulsar.broker.service.AbstractBaseDispatcher.filterEntriesForConsumer(AbstractBaseDispatcher.java:125) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentStickyKeyDispatcherMultipleConsumers.sendMessagesToConsumers(PersistentStickyKeyDispatcherMultipleConsumers.java:192) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.pulsar.broker.service.persistent.PersistentDispatcherMultipleConsumers.readEntriesComplete(PersistentDispatcherMultipleConsumers.java:480) ~[org.apache.pulsar-pulsar-broker-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

java.lang.NullPointerException: null
	at org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger-2.6.1.jar:2.6.1]
	at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common-4.10.0.jar:4.10.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.48.Final.jar:4.1.48.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]

Proxy

08:27:14.083 [pulsar-proxy-io-2-3] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

08:27:14.090 [pulsar-proxy-io-2-3] WARN  org.apache.pulsar.proxy.server.ProxyConnection - [/10.56.0.21:42612] Got exception NativeIoException : readAddress(..) failed: Connection reset by peer null

08:27:38.090 [pulsar-proxy-io-2-8] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.

io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer

08:27:38.090 [pulsar-proxy-io-2-8] WARN  org.apache.pulsar.proxy.server.ProxyConnection - [/10.56.0.17:51340] Got exception NativeIoException : readAddress(..) failed: Connection reset by peer null

When inspecting the subscription, via the pulsar-admin cli tool, we see no consumers at all despite having 3 services running (and using your Pulsar Golang library - the native one, no cgo):

$ bin/pulsar-admin topics stats SpaceEvents | jq '.subscriptions["cloud-pulsar-tester"]' -c | jq

{
  "msgRateOut": 0,
  "msgThroughputOut": 0,
  "bytesOutCounter": 0,
  "msgOutCounter": 0,
  "msgRateRedeliver": 0,
  "chuckedMessageRate": 0,
  "msgBacklog": 181626,
  "msgBacklogNoDelayed": 181626,
  "blockedSubscriptionOnUnackedMsgs": false,
  "msgDelayed": 0,
  "unackedMessages": 0,
  "type": "Key_Shared",
  "msgRateExpired": 0,
  "lastExpireTimestamp": 0,
  "lastConsumedFlowTimestamp": 1598516857278,
  "lastConsumedTimestamp": 0,
  "lastAckedTimestamp": 0,
  "consumers": [],
  "isDurable": true,
  "isReplicated": false
}

If we inspect the Golang µservices logs with a lower debug level we see logs coming from the underlying Pulsar Golang library as shown below (please note that the following logs are regularly printed also after the consumers get stuck):

{"level":"debug","severity":"DEBUG","time":"2020-08-27T09:14:12Z","topic":"SpaceEvents"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
{"level":"debug","local_addr":{"IP":"10.56.0.17","Port":51336,"Zone":""},"remote_addr":{"Scheme":"pulsar","Opaque":"","User":null,"Host":"pulsar-proxy.pulsar:6650","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"severity":"DEBUG","time":"2020-08-27T09:14:22Z"}
Read more comments on GitHub >

github_iconTop Results From Across the Web

C++ error: terminate called after throwing an instance of 'std
This code has 3 holes: First hole: int numEntries . Later you do: ++numEntries;. You increment unspecified value. Not sure if it's UB, ......
Read more >
[GitHub] [pulsar] fracasula commented on issue #7682: Stuck ...
[GitHub] [pulsar] fracasula commented on issue #7682: Stuck consumer: terminate called after throwing an instance of 'std::bad_alloc'.
Read more >
C++ – Terminate called after throwing an instance ... - iTecNote
I am going over C++ exceptions and am running into an error that I am unsure of why it is giving me issues:...
Read more >
terminate called after throwing an instance of 'std::bad_alloc'
I have version 3.42.2-1 installed, together with the according packages like e-d-s etc. ... I tried rebooting and also tried launching Evolution ...
Read more >
org.apache.pulsar.commits - 2020 September - 2263 messages
[GitHub] [pulsar] fracasula edited a comment on issue #7682: Stuck consumer: terminate called after throwing an instance of 'std::bad_alloc' - GitBox.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found