question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Geo DR Recovery Breaks Consumers On Restart

See original GitHub issue

Hi,

I was testing the Geo DR recovery story here.

I’ve found that after a failover has completed, and I restart the consumer, I constantly recieve errors that look like:

Error: The supplied offset ‘4984’ is invalid. The last offset in the system is ‘96’ TrackingId:cae58860-80ef-4c0b-8fd9-86658d4c31d9_B24`

I have tried setting InitialOffsetProvider = (_) => EventPosition.FromEnd() but this seems to have no effect. The processor is still trying to go through old commit offsets (why even have the options in the first place if they’re ignored?)

I’m using the sample code provided here


Order of events

  1. Create primary Eventhub namespace and hub
  2. Create secondary Eventhub namespace in different region with no hub
  3. Go to Geo-Recovery in primary namespace and link secondary namespace with a new alias
  4. Start the Receiver
  5. Start the Sender
  6. Initiate failover in Azure portal
    • note: there is no stopping of sender/receiver while failover is happening, there are no errors before, during, or after failover so long as processes aren’t stopped.
  7. After failover completes, stop sender and receiver
  8. Start receiver, results in error above
  9. Start sender, results in no errors

The only way I have gotten these errors to stop and get everything running as normal again is to delete the Storage Accoung blob container that contains the commits.


This is an important scanerio for my team and it seems as though if I setup geo-recovery it will break all of my consumers. Or I have to tell all my readers to delete their commit blobs in the event of a failover and their application restarts.

Is there any fix for this? It seems as though if EventProcessorHost used InitialOffsetProvider from EventProcessorOptions instead of ignoring it this wouldn’t be an issue. As the reader would read new data from partitions without checking for the previous offsets (which don’t exist because they come from a different Hub).

Versions

  • OS platform and version: Windows 10 1903
  • .NET Version: Core 2.1
  • NuGet package version or commit ID:

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
serkantkaracacommented, Apr 1, 2019

Please note that this is not only for EPH, underlying receivers won’t also handle DR namespace switch and hence will need to be restarted. I will find out where we can put some appropriate information into public documents.

Currently there is no ETA but I can say this will be addressed before end of this year.

0reactions
axisccommented, Aug 21, 2019

@keggster101020

Please let us know if running a monitor job to check failover state resolved the issue.

I’m closing this issue for now, but if your issue isn’t resolved please open a new issue and reference this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure Event Hubs - Geo-disaster recovery
The Geo-Disaster recovery feature ensures that the entire configuration of a namespace (Event Hubs, Consumer Groups and settings) is ...
Read more >
Disaster Recovery (Geo)
Re-enable migrations now that PostgreSQL is restarted and listening on the private address. Edit /etc/gitlab/gitlab.rb and change the configuration to true :.
Read more >
Disaster recovery for planned failover
As replication between Geo sites is asynchronous, a planned failover requires a maintenance window in which updates to the primary site are blocked....
Read more >
Architecting disaster recovery for cloud infrastructure outages
Step-by-step guide to designing disaster recovery for applications in Google ... Data will be stored in a single region within the geographic location....
Read more >
10.11. Troubleshooting Geo-replication
After restarting geo-replication, it will begin a synchronization of the data using checksums. This may be a long and resource intensive process on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found