Wrong seqNo is set when reading from eventhubs
See original GitHub issueBug Report:
- Actual behavior Same issue as https://github.com/Azure/azure-event-hubs-spark/issues/462. The consequence is incredibly severe. I am unable to restart the same Spark Structured Stream without creating a new checkpoint.
The following is the stacktrace:
Job aborted due to stage failure: Task 30 in stage 2348.0 failed 4 times, most recent failure: Lost task 30.3 in stage 2348.0 (TID 4058, 10.139.64.6, executor 0): java.lang.IllegalStateException: In partition 30 of http-access-log, with consumer group $Default, request seqNo 19609525 is less than the received seqNo 19684911. The earliest seqNo is 19684804 and the last seqNo is 20231767
at org.apache.spark.eventhubs.client.CachedEventHubsReceiver.checkCursor(CachedEventHubsReceiver.scala:189)
at org.apache.spark.eventhubs.client.CachedEventHubsReceiver.org$apache$spark$eventhubs$client$CachedEventHubsReceiver$$receive(CachedEventHubsReceiver.scala:213)
at org.apache.spark.eventhubs.client.CachedEventHubsReceiver$.receive(CachedEventHubsReceiver.scala:288)
at org.apache.spark.eventhubs.rdd.EventHubsRDD.compute(EventHubsRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:353)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:317)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
- Expected behavior Offset is set correctly
- Spark version
2.4.5
- spark-eventhubs artifactId and version
com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.14.1
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Wrong seqNo is set when reading from evenehubs #462
I am using azure-event-hubs-spark connector to read data from eventhubs and write to one elasticsearch cluster and I got the following errors.
Read more >azure-event-hubs-spark/Lobby - Gitter
I'm trying to read static csv data files and sent each row as message to EventHub. Basically, using spark.read.format("csv") to create the dataframe...
Read more >Azure Event Hubs Checkpoint Store client library for Java
The offset/sequence number enables an event consumer (reader) to specify a point in the event stream from which they want to begin reading ......
Read more >azure-eventhub-checkpointstoreblob-aio - PyPI
Microsoft Azure Event Hubs checkpointer implementation with Blob Storage ... The offset/sequence number enables an event consumer (reader) to specify a ...
Read more >Unable to read Azure Eventhub topics from spark
You need to add EventHubs package when creating session: park = SparkSession.builder.appName('ntorq_eventhub_load')\ ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @k4jiang it is a known issue and we’re working on a fix. We are going to release a new version with the fix in several days.
The version 2.3.15 includes the fix.