Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QUERY] Zero downtime Event Hub processor migration from v4 to v5

See original GitHub issue

Library name and version

Azure.Messaging.EventHubs 5.6.2

Query/Question

Is it possible to achieve zero-downtime migration of Event Hub consumers using v4 (Microsoft.Azure.EventHubs) to v5 (Azure.Messaging.EventHubs)? If not, do you have any tips on minimizing the downtime?

The breaking change in checkpoint format is a major problem in migration. Currently we have following plan:

Use custom EventProcessor<TPartition> which is able to read legacy checkpoints (in OnInitializingPartitionAsync).
The old consumers are running on side A.
Deploy new code to side B. Events are not consumed here since v5 SDK uses epoch value 0 and consumers are endlessly restarting due to encountering “epoch exception” [1].
Disconnect old consumers by stopping side A.
New consumers start consuming events in side B.

There is a possibility of downtime between steps 4 and 5 (depending on how fast you can stop old consumers + how fast Azure Event Hub service detects the disconnects + how fast new consumers detect the disconnects). Are there any settings in SDK or Azure portal that would allow to minimize that?

The best option would be to pass higher epoch in v5 SDK (forcing old consumers to disconnect) but it is not possible - the 0 value is hardcoded in EventProcessor class.

[1]

Exception Message	"Receiver 'e69d42ae-72a6-418e-be1f-4d388a390188' with a higher epoch '2' already exists. Receiver 'P1-b47177da-cf2b-46f0-8cd1-dfa007165fd9' with epoch 0 cannot be created. Make sure you are creating receiver with increasing epoch value to ensure connectivity, or ensure all old epoch receivers are closed or disconnected.

Environment

No response

Issue Analytics

State:
Created 2 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

rzepinskipcommented, Mar 18, 2022

Ok, I think all clear now:

We should be fine with LoadBalancingInterval * 2 downtime (deployment-related delays easily outweigh that) and resource-usage increase is temporary(during migration) so also not a problem.
Exposing epoch is public interface change so requires long timeline + there may be a lot of intricacies around that we would have to consider.

Thanks for your help. Feel free to close the issue.

0reactions

msftbot[bot]commented, Mar 18, 2022

Hi @rzepinskip. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

Top Results From Across the Web

Migrating from Event Hubs library v4 to v5 ignores old ...

After deploying the code based on new library, EventProcessorClient will start processing messages from the beginning of stream in EventHub, ...

Troubleshoot connectivity issues - Azure Event Hubs

Check if there's a service outage. Check for the Azure Event Hubs service outage on the Azure service status site. Verify the connection...

After upgrading an Azure Function, the Event Hub trigger ...

I upgraded an Azure Function from .net 3.1, v3 to .net 6.0 v4 and my Event Hub trigger stopped working. I often say,...

Migrating VMs with Migrate to Virtual Machines

This document is the first of a series that guides you through migrating your virtual machines (VMs) from your source environment to Google...

Best Practices and Architectures for Real Time Data

All Event Hubs consumers connect via the AMQP 1.0 session, and events are delivered through the session as they become available. The following ......