Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Events not received on consumer during ```Too much pending tasks``` error in Azure Event hub sdk

See original GitHub issue

Package Name: @azure/event-hubs
Package Version: 5.4.0
Operating system:
nodejs
- version: v14.15.4
browser
- name/version:
typescript
- version: 4.1.5
Is the bug related to documentation in
- README.md
- source code documentation
- SDK API docs on https://docs.microsoft.com

Describe the bug Some events not received by the consumer. Possibly it is related to this ticket 14606.

We received some errors in processError method on 13thMay. The log message said MessagingError: The request with message_id \"5db3cdec-6b8b-46f0-b333-f7cf3f00c7aa\" timed out. Please try again later. We also print the partiton Id in the log but it was not present in the error message thrown by Azure sdk. Detail log is below:-

{"level":"error", "time":"2021-05-12T23:33:10.991Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : MessagingError: The request with message_id \"5db3cdec-6b8b-46f0-b333-f7cf3f00c7aa\" timed out. Please try again later.}"}

We received another log message Error: Too much pending tasks in processError method on 20th May. Detail log is below:-

{"level":"error", "time":"2021-05-25T22:50:53.372Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : Error: Too much pending tasks}"}

Point to note:-

From 13th to 19th May we constantly received timeout out error messages after which no events were received.
As soon as we received too much pending tasks on 20th may and still no events were received and at last when we restarted our machines the events started coming in most probably the ones that were in event hub queue also came.
Once everything was settled after the restart we thought all the events that were missed must have been processed by now. But after further analysis we found around 50 events were missing from 20thMay till 26th May. Now as per our analysis we suspect the issue for missing events might be:-
The error came in below method in our code:-

processError: (error, context): Promise<void> => {
	AppLogger.error(
	`Error in listening on eventHub: ${context.eventHubName} and partitionId ${context.partitionId}: ${error}}`,
					EventHubService.name
	);
	return Promise.resolve();
},

We catch the error and go ahead with processing other messages in below method:-

processEvents: async (events, context): Promise<void> => {
	AppLogger.info(
		`Received events array length is: ${events?.length}`,
		EventHubService.name
	);
	await Promise.all(events.map(async (event) => {
		await this.handleEvent(event);
		if (config.has('env.MODE') && config.get('env.MODE') === config.get('env.PROD_MODE')) {
						// update checkpoint as per checkpoint frequency
						await this.updateEventCheckpoint(context, events[events.length - 1]);
		}
	}));
},

So when the Timeout message error came in Azure sdk we didn’t received any messages and our services was not processing anything.

EDIT:- I gathered some more insights. We have 2 consumer groups mapped to 2 different consumer services listening to a single event hub. So, as one of the service never received the event which is the issue I raised while the other service received the event as below:-

{"level":"info", "time":"2021-05-25T08:47:45.412Z","context":"DevicesService","msg":"Received event: device created, device id: w3phoenix-61808291846052781-prod"}

I noticed the same time logs [when the other service received the event 2021-05-25T08:47:45.412Z] in my service which stopped receiving events too. The observations were:-

processEvents was called at frequent intervals between [+5:30 IST on above time]14:16:00 to 14:18:00 and the array length was 0 i.e. the array had no events in it.
Between 14:16:00 to 14:18:00 there were Too much pending tasks error exists. Log is below:- {"level":"error", "time":"2021-05-25T08:46:11.966Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : Error: Too much pending tasks}"}

So on 26th May when we got the information that events are not received, then we restarted the service and thought the events would be in EventHub pipeline and we will receive those. But the event in question above was never received.

Now the question is:-

The events which were not received during the time sdk was in error state i.e. Timed out will those events be present in Event hub queue? If so, we never received those when we restarted our service. So, Is it the events are still in event hub queue? If so, how can we confirm that, can you suggest a metrics or log somewhere would help.

Below Screenshot I took on 1st May which shows Captured Backlog as 0.

As per my understanding, it should have shown some events in backlog that were enqueued till 7 days back from 1st of May. Also it shows User Errors to be 6. I’m still not sure exactly what this User Error means. Incoming vs Outgoing message is somewhat making sense as my service restarted on 26th evening and there is one more service listening to the same hub. So outgoing is ahead of incoming.

InkedMicrosoftTeams-image (5)

If point 1) stand true, how we can replay those events in our service? Since we never came to know the offset of those events barring one event out 50 which we received the error other than that none of the other 49 events were received by our service.
We never faced this type of issue before of missing events, only correlation we see in missing events is that it happened at the time when the Azure sdk started giving too much pending tasks error i.e. from 20th May. Is it some what really related?

To Reproduce Steps to reproduce the behavior:

2 Container running in AWS listening on event hub using EventHub consumer client. There are 32 partitions in Event hub.

Expected behavior Events should never be missed by consumer

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

chradekcommented, Jun 14, 2021

@amit12cool Are you setting a startPosition when you start your subscribe call? By default, if no checkpoint exists the consumer will start reading new events that come in after subscribe is called. You can configure it to start from the beginning of the event stream instead like this:

const { EventHubConsumerClient, earliestEventPosition } = require('@azure/event-hubs');
const client = new EventHubConsumerClient(
    EventHubConsumerClient.defaultConsumerGroupName,
    connectionString,
    eventHubName,
    checkpointStore
);

client.subscribe({
    processEvents(events, {partitionId}) {
        if (events.length) {
            console.log(events[0].sequenceNumber)
            console.log(`Received ${events.length} events on partition "${partitionId}"`);
        }
    },
    processError(err, {partitionId}) {
        console.error(`Error received on partition "${partitionId}"`);
        console.error(err);
    }
}, {startPosition: earliestEventPosition}) // If no checkpoint exists, the start of the partition will be used.

0reactions

amit12coolcommented, Jun 21, 2021

@chradek Thanks for the solution, that helps. And we were able to do the RCA of this issue on our side and found we had set message retention of 1 days i.e. default set when we create a event hub. We need it to be set max to 7 days.

Top Results From Across the Web

Receiving "Too much pending tasks" error when we use the ...

Once this error starts showing the consumer can no longer receive any new events. Before posting this issue, I have checked out other...

Troubleshoot connectivity issues - Azure Event Hubs

There are various reasons for client applications not able to connect to an event hub. The connectivity issues that you experience may be ......

Chapter 4. Real-Time Processing in Azure - O'Reilly

The logical group of consumers that receive messages from each Event Hub ... When consumers process events from a partition, they can typically...

Azure Event Hubs Source | Sumo Logic Docs

An Azure Event Hubs Source tracks errors, reports its health, and start-up progress. You're informed, in real-time, if the Source is having trouble...

Apache Kafka Reference Guide - Quarkus

For a quick start take a look at Getting Started to SmallRye Reactive ... is only used when the application runs in prod...