Add failOnDataLoss parameter - configurable strictness when events expire
See original GitHub issueA bit of background for this issue. In Event Hubs, there’s a retention policy of 1 to 7 days. Once an event has been in your Event Hubs for your retention policy length, it will be removed by the service – I say that the event expires.
In Spark there’re are two key events worth mentioning:
-
The driver schedules a consumption range for a specific partition. Let’s say from
x
toy
for partition1
. This scheduling requires asking the Event Hubs service, “What events are available in the Event Hubs?” -
The executor tries to consume that range of events - in our case from
x
toy
.
Now, there is a case when an event is present during the scheduling but it expires before the consumption. When this happens, the connector detects it, freaks out, and crashes the job (very much on purpose, though).
I re-wrote this connector entirely and it’s only be live since mid-March - it’s quite young! It’s becoming clear to me that this level of strictness is extreme, so I’d like that strictness to be variable. Users should have two options when we detect this occurrence:
- Crash the Spark job - I want a high level of strictness!
- Print this as a warning in the logs - I want to know this happened, but I don’t want my Spark job to go down!
These two options will be available through a boolean setting, failOnDataLoss. The default value will be true.
Let me know if there’re any questions/comments/etc 👍
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:5 (2 by maintainers)
Top GitHub Comments
Second vote for the warning 😃 Even better behind a flag, but for now the warning-only is the exact same thing I did in the version I built from my fork.
Yes, please implement that! I was just thinking about opening a feature request with a similar idea, but this one is better.