Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add failOnDataLoss parameter - configurable strictness when events expire

See original GitHub issue

A bit of background for this issue. In Event Hubs, there’s a retention policy of 1 to 7 days. Once an event has been in your Event Hubs for your retention policy length, it will be removed by the service – I say that the event expires.

In Spark there’re are two key events worth mentioning:

The driver schedules a consumption range for a specific partition. Let’s say from x to y for partition 1. This scheduling requires asking the Event Hubs service, “What events are available in the Event Hubs?”
The executor tries to consume that range of events - in our case from x to y.

Now, there is a case when an event is present during the scheduling but it expires before the consumption. When this happens, the connector detects it, freaks out, and crashes the job (very much on purpose, though).

I re-wrote this connector entirely and it’s only be live since mid-March - it’s quite young! It’s becoming clear to me that this level of strictness is extreme, so I’d like that strictness to be variable. Users should have two options when we detect this occurrence:

Crash the Spark job - I want a high level of strictness!
Print this as a warning in the logs - I want to know this happened, but I don’t want my Spark job to go down!

These two options will be available through a boolean setting, failOnDataLoss. The default value will be true.

Let me know if there’re any questions/comments/etc 👍

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

p2bauercommented, Aug 20, 2018

Second vote for the warning 😃 Even better behind a flag, but for now the warning-only is the exact same thing I did in the version I built from my fork.

1reaction

jpsimencommented, Aug 20, 2018

Yes, please implement that! I was just thinking about opening a feature request with a similar idea, but this one is better.

Top Results From Across the Web

Structured Streaming + Kafka Integration Guide (Kafka broker ...

In Spark 3.1 a new configuration option added spark.sql.streaming.kafka.useDeprecatedOffsetFetching (default: true ) which could be set to false allowing Spark ...