Events not received on consumer during ```Too much pending tasks``` error in Azure Event hub sdk
See original GitHub issue- Package Name: @azure/event-hubs
- Package Version: 5.4.0
- Operating system:
- nodejs
- version: v14.15.4
- browser
- name/version:
- typescript
- version: 4.1.5
- Is the bug related to documentation in
- README.md
- source code documentation
- SDK API docs on https://docs.microsoft.com
Describe the bug Some events not received by the consumer. Possibly it is related to this ticket 14606.
We received some errors in processError
method on 13thMay. The log message said MessagingError: The request with message_id \"5db3cdec-6b8b-46f0-b333-f7cf3f00c7aa\" timed out. Please try again later.
We also print the partiton Id in the log but it was not present in the error message thrown by Azure sdk.
Detail log is below:-
{"level":"error", "time":"2021-05-12T23:33:10.991Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : MessagingError: The request with message_id \"5db3cdec-6b8b-46f0-b333-f7cf3f00c7aa\" timed out. Please try again later.}"}
We received another log message Error: Too much pending tasks
in processError
method on 20th May.
Detail log is below:-
{"level":"error", "time":"2021-05-25T22:50:53.372Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : Error: Too much pending tasks}"}
Point to note:-
-
From 13th to 19th May we constantly received
timeout out
error messages after which no events were received. -
As soon as we received
too much pending tasks
on 20th may and still no events were received and at last when we restarted our machines the events started coming in most probably the ones that were in event hub queue also came. -
Once everything was settled after the restart we thought all the events that were missed must have been processed by now. But after further analysis we found around 50 events were missing from 20thMay till 26th May. Now as per our analysis we suspect the issue for missing events might be:-
-
The error came in below method in our code:-
processError: (error, context): Promise<void> => {
AppLogger.error(
`Error in listening on eventHub: ${context.eventHubName} and partitionId ${context.partitionId}: ${error}}`,
EventHubService.name
);
return Promise.resolve();
},
We catch the error and go ahead with processing other messages in below method:-
processEvents: async (events, context): Promise<void> => {
AppLogger.info(
`Received events array length is: ${events?.length}`,
EventHubService.name
);
await Promise.all(events.map(async (event) => {
await this.handleEvent(event);
if (config.has('env.MODE') && config.get('env.MODE') === config.get('env.PROD_MODE')) {
// update checkpoint as per checkpoint frequency
await this.updateEventCheckpoint(context, events[events.length - 1]);
}
}));
},
So when the Timeout message
error came in Azure sdk we didn’t received any messages and our services was not processing anything.
EDIT:- I gathered some more insights. We have 2 consumer groups mapped to 2 different consumer services listening to a single event hub. So, as one of the service never received the event which is the issue I raised while the other service received the event as below:-
{"level":"info", "time":"2021-05-25T08:47:45.412Z","context":"DevicesService","msg":"Received event: device created, device id: w3phoenix-61808291846052781-prod"}
I noticed the same time logs [when the other service received the event 2021-05-25T08:47:45.412Z] in my service which stopped receiving events too. The observations were:-
- processEvents was called at frequent intervals between [+5:30 IST on above time]14:16:00 to 14:18:00 and the array length was 0 i.e. the array had no events in it.
- Between 14:16:00 to 14:18:00 there were
Too much pending tasks
error exists. Log is below:-{"level":"error", "time":"2021-05-25T08:46:11.966Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : Error: Too much pending tasks}"}
So on 26th May when we got the information that events are not received, then we restarted the service and thought the events would be in EventHub pipeline and we will receive those. But the event in question above was never received.
Now the question is:-
- The events which were not received during the time sdk was in error state i.e.
Timed out
will those events be present in Event hub queue? If so, we never received those when we restarted our service. So, Is it the events are still in event hub queue? If so, how can we confirm that, can you suggest a metrics or log somewhere would help.
Below Screenshot I took on 1st May which shows Captured Backlog as 0.
As per my understanding, it should have shown some events in backlog that were enqueued till 7 days back from 1st of May. Also it shows User Errors to be 6. I’m still not sure exactly what this User Error means. Incoming vs Outgoing message is somewhat making sense as my service restarted on 26th evening and there is one more service listening to the same hub. So outgoing is ahead of incoming.
-
If point 1) stand true, how we can replay those events in our service? Since we never came to know the offset of those events barring one event out 50 which we received the error other than that none of the other 49 events were received by our service.
-
We never faced this type of issue before of missing events, only correlation we see in missing events is that it happened at the time when the Azure sdk started giving too much pending tasks error i.e. from 20th May. Is it some what really related?
To Reproduce Steps to reproduce the behavior:
- 2 Container running in AWS listening on event hub using EventHub consumer client. There are 32 partitions in Event hub.
Expected behavior Events should never be missed by consumer
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
@amit12cool Are you setting a
startPosition
when you start your subscribe call? By default, if no checkpoint exists the consumer will start reading new events that come in aftersubscribe
is called. You can configure it to start from the beginning of the event stream instead like this:@chradek Thanks for the solution, that helps. And we were able to do the RCA of this issue on our side and found we had set
message retention
of 1 days i.e. default set when we create a event hub. We need it to be set max to 7 days.