question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Inconsistent reader and writer schema in HoodieAvroDataBlock cause exception

See original GitHub issue

Related JIRA ticket: https://issues.apache.org/jira/browse/HUDI-5271

Describe the problem you faced

When using Spark to create a hudi table with these config,

  • INMEMORY or CONSISTENT_HASHING BUCKET index
  • Decimal data type in schema
  • MOR Hudi table
  • Use Spark catalog

It will trigger an exception when query this table.

ERROR org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got exception when reading log file
org.apache.avro.AvroTypeException: Found hoodie.test_mor_tab.test_mor_tab_record.new_test_col.fixed, expecting union
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:209)
...

To Reproduce

Steps to reproduce the behavior:

class TestInsertTable extends HoodieSparkSqlTestBase {

  test("Test Insert Into MOR table") {
    withTempDir { tmp =>
      val tableName = "test_mor_tab"
      // Create a partitioned table
      spark.sql(
        s"""
           |create table $tableName (
           |  id int,
           |  dt string,
           |  name string,
           |  price double,
           |  ts long,
           |  new_test_col decimal(25, 4) comment 'a column for test decimal type'
           |) using hudi
           |options
           |(
           |    type = 'mor'
           |    ,primaryKey = 'id'
           |    ,hoodie.index.type = 'INMEMORY'
           |)
           | tblproperties (primaryKey = 'id')
           | partitioned by (dt)
           | location '${tmp.getCanonicalPath}'
       """.stripMargin)

      // Note: Do not write the field alias, the partition field must be placed last.
      spark.sql(
        s"""
           | insert into $tableName values
           | (1, 'a1', 10, 1000, 1.0, "2021-01-05"),
           | (2, 'a2', 20, 2000, 2.0, "2021-01-06"),
           | (3, 'a3', 30, 3000, 3.0, "2021-01-07")
              """.stripMargin)

      spark.sql(s"select id, name, price, ts, dt from $tableName").show(false)
    }
  }
}
  1. Rdd this test case in module hudi-spark-datasource/hudi-spark in the test class org.apache.spark.sql.hudi.TestInsertTable

Expected behavior

This test case should run properly without any exception

Environment Description

  • Hudi version : latest master branch, commit 3109d890f13b1b29e5796a9f34ab28fa898ec23c

  • Spark version : tried Spark 2.4/3.1, all have the same issue

  • Hive version : N/A

  • Hadoop version : N/A

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

The full error stack of the above test case


19963 [ScalaTest-run-running-TestInsertTable] INFO  org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator [] - Code generated in 20.563541 ms
20015 [ScalaTest-run-running-TestInsertTable] INFO  org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator [] - Code generated in 18.67177 ms
20036 [ScalaTest-run-running-TestInsertTable] INFO  org.apache.spark.SparkContext [] - Starting job: apply at OutcomeOf.scala:85
20036 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - Got job 24 (apply at OutcomeOf.scala:85) with 1 output partitions
20036 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - Final stage: ResultStage 35 (apply at OutcomeOf.scala:85)
20036 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - Parents of final stage: List()
20036 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - Missing parents: List()
20037 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - Submitting ResultStage 35 (MapPartitionsRDD[71] at apply at OutcomeOf.scala:85), which has no missing parents
20051 [dag-scheduler-event-loop] INFO  org.apache.spark.storage.memory.MemoryStore [] - Block broadcast_33 stored as values in memory (estimated size 15.4 KB, free 2002.3 MB)
20060 [dag-scheduler-event-loop] INFO  org.apache.spark.storage.memory.MemoryStore [] - Block broadcast_33_piece0 stored as bytes in memory (estimated size 7.4 KB, free 2002.3 MB)
20060 [dispatcher-event-loop-0] INFO  org.apache.spark.storage.BlockManagerInfo [] - Added broadcast_33_piece0 in memory on 10.2.175.58:53317 (size: 7.4 KB, free: 2004.2 MB)
20061 [dag-scheduler-event-loop] INFO  org.apache.spark.SparkContext [] - Created broadcast 33 from broadcast at DAGScheduler.scala:1161
20061 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - Submitting 1 missing tasks from ResultStage 35 (MapPartitionsRDD[71] at apply at OutcomeOf.scala:85) (first 15 tasks are for partitions Vector(0))
20061 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.TaskSchedulerImpl [] - Adding task set 35.0 with 1 tasks
20064 [dispatcher-event-loop-1] INFO  org.apache.spark.scheduler.TaskSetManager [] - Starting task 0.0 in stage 35.0 (TID 44, localhost, executor driver, partition 0, PROCESS_LOCAL, 8249 bytes)
20064 [Executor task launch worker for task 44] INFO  org.apache.spark.executor.Executor [] - Running task 0.0 in stage 35.0 (TID 44)
20080 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.HoodieTableMetaClient [] - Loading HoodieTableMetaClient from file:/private/var/folders/n7/7v_cwpdn79lc75czxd84bdd8mwzd9l/T/spark-cdd1a67d-c0be-4c46-826a-445e29dfa751
20081 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.HoodieTableConfig [] - Loading table properties from file:/private/var/folders/n7/7v_cwpdn79lc75czxd84bdd8mwzd9l/T/spark-cdd1a67d-c0be-4c46-826a-445e29dfa751/.hoodie/hoodie.properties
20081 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.HoodieTableMetaClient [] - Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=PARQUET) from file:/private/var/folders/n7/7v_cwpdn79lc75czxd84bdd8mwzd9l/T/spark-cdd1a67d-c0be-4c46-826a-445e29dfa751
20083 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.timeline.HoodieActiveTimeline [] - Loaded instants upto : Option{val=[20221123105000131__deltacommit__COMPLETED]}
20083 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Scanning log file HoodieLogFile{pathStr='file:/private/var/folders/n7/7v_cwpdn79lc75czxd84bdd8mwzd9l/T/spark-cdd1a67d-c0be-4c46-826a-445e29dfa751/dt=2021-01-05/.04aba946-8423-4ddd-9d04-fbbd91ba37a2-0_20221123105000131.log.1_0-17-17', fileLen=-1}
20084 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Reading a data block from file file:/private/var/folders/n7/7v_cwpdn79lc75czxd84bdd8mwzd9l/T/spark-cdd1a67d-c0be-4c46-826a-445e29dfa751/dt=2021-01-05/.04aba946-8423-4ddd-9d04-fbbd91ba37a2-0_20221123105000131.log.1_0-17-17 at instant 20221123105000131
20084 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Merging the final data blocks
20084 [Executor task launch worker for task 44] INFO  org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Number of remaining logblocks to merge 1
20086 [Executor task launch worker for task 44] ERROR org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader [] - Got exception when reading log file
org.apache.avro.AvroTypeException: Found hoodie.test_mor_tab.test_mor_tab_record.new_test_col.fixed, expecting union
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:207) ~[classes/:?]
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:144) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:633) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:715) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:368) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209) ~[classes/:?]
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:112) ~[classes/:?]
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:105) ~[classes/:?]
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343) ~[classes/:?]
	at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305) ~[classes/:?]
	at org.apache.hudi.LogFileIterator.<init>(LogFileIterator.scala:88) ~[classes/:?]
	at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) ~[classes/:?]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.scheduler.Task.run(Task.scala:123) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_345]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_345]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_345]
20095 [Executor task launch worker for task 44] ERROR org.apache.spark.executor.Executor [] - Exception in task 0.0 in stage 35.0 (TID 44)
org.apache.hudi.exception.HoodieException: Exception when reading log file 
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:377) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209) ~[classes/:?]
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:112) ~[classes/:?]
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:105) ~[classes/:?]
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343) ~[classes/:?]
	at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305) ~[classes/:?]
	at org.apache.hudi.LogFileIterator.<init>(LogFileIterator.scala:88) ~[classes/:?]
	at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96) ~[classes/:?]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.scheduler.Task.run(Task.scala:123) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) ~[spark-core_2.11-2.4.4.jar:2.4.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_345]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_345]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_345]
Caused by: org.apache.avro.AvroTypeException: Found hoodie.test_mor_tab.test_mor_tab_record.new_test_col.fixed, expecting union
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) ~[avro-1.8.2.jar:1.8.2]
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:207) ~[classes/:?]
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:144) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:633) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:715) ~[classes/:?]
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:368) ~[classes/:?]
	... 27 more
20111 [task-result-getter-0] WARN  org.apache.spark.scheduler.TaskSetManager [] - Lost task 0.0 in stage 35.0 (TID 44, localhost, executor driver): org.apache.hudi.exception.HoodieException: Exception when reading log file 
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:377)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209)
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:112)
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:105)
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343)
	at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305)
	at org.apache.hudi.LogFileIterator.<init>(LogFileIterator.scala:88)
	at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.avro.AvroTypeException: Found hoodie.test_mor_tab.test_mor_tab_record.new_test_col.fixed, expecting union
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:207)
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:144)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:633)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:715)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:368)
	... 27 more

20113 [task-result-getter-0] ERROR org.apache.spark.scheduler.TaskSetManager [] - Task 0 in stage 35.0 failed 1 times; aborting job
20113 [task-result-getter-0] INFO  org.apache.spark.scheduler.TaskSchedulerImpl [] - Removed TaskSet 35.0, whose tasks have all completed, from pool 
20116 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.TaskSchedulerImpl [] - Cancelling stage 35
20117 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.TaskSchedulerImpl [] - Killing all running tasks in stage 35: Stage cancelled
20118 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler [] - ResultStage 35 (apply at OutcomeOf.scala:85) failed in 0.081 s due to Job aborted due to stage failure: Task 0 in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage 35.0 (TID 44, localhost, executor driver): org.apache.hudi.exception.HoodieException: Exception when reading log file 
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:377)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:220)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scan(AbstractHoodieLogRecordReader.java:209)
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:112)
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.<init>(HoodieMergedLogRecordScanner.java:105)
	at org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:343)
	at org.apache.hudi.LogFileIterator$.scanLog(LogFileIterator.scala:305)
	at org.apache.hudi.LogFileIterator.<init>(LogFileIterator.scala:88)
	at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.avro.AvroTypeException: Found hoodie.test_mor_tab.test_mor_tab_record.new_test_col.fixed, expecting union
	at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
	at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
	at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:207)
	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock$RecordIterator.next(HoodieAvroDataBlock.java:144)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processDataBlock(AbstractHoodieLogRecordReader.java:633)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.processQueuedBlocksForInstant(AbstractHoodieLogRecordReader.java:715)
	at org.apache.hudi.common.table.log.AbstractHoodieLogRecordReader.scanInternal(AbstractHoodieLogRecordReader.java:368)
	... 27 more

Issue Analytics

  • State:open
  • Created 10 months ago
  • Reactions:1
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
TengHuocommented, Nov 28, 2022

Hi @codope

I raised a new PR for fixing this issue: https://github.com/apache/hudi/pull/7307 It is also based on the Alexey’s fix https://github.com/apache/hudi/pull/6358

Could you help to review it. Really appreciate.

1reaction
TengHuocommented, Nov 23, 2022

@voonhous and me did some trouble shooting on this issue. And we found it is cased by the difference between writer schema and reader schema at this line:

https://github.com/apache/hudi/blob/76a28daeb08e7192d75dfc447624c827643bef0d/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java#L171

Writer schema:

{
    "type": "record",
    "name": "test_mor_tab_record",
    "namespace": "hoodie.test_mor_tab",
    "fields": [
        {
            "name": "_hoodie_commit_time",
            "type": [
                "null",
                "string"
            ],
            "doc": "",
            "default": null
        },
        {
            "name": "_hoodie_commit_seqno",
            "type": [
                "null",
                "string"
            ],
            "doc": "",
            "default": null
        },
        {
            "name": "_hoodie_record_key",
            "type": [
                "null",
                "string"
            ],
            "doc": "",
            "default": null
        },
        {
            "name": "_hoodie_partition_path",
            "type": [
                "null",
                "string"
            ],
            "doc": "",
            "default": null
        },
        {
            "name": "_hoodie_file_name",
            "type": [
                "null",
                "string"
            ],
            "doc": "",
            "default": null
        },
        {
            "name": "id",
            "type": "int"
        },
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "price",
            "type": "double"
        },
        {
            "name": "ts",
            "type": "long"
        },
        {
            "name": "new_test_col",
            "type": {
                "type": "fixed",
                "name": "fixed",
                "namespace": "hoodie.test_mor_tab.test_mor_tab_record.new_test_col",
                "size": 11,
                "logicalType": "decimal",
                "precision": 25,
                "scale": 4
            },
            "doc": "a column for test decimal type"
        },
        {
            "name": "dt",
            "type": "string"
        }
    ]
}

Reader schema:

{
    "type": "record",
    "name": "Record",
    "fields": [
        {
            "name": "_hoodie_commit_time",
            "type": [
                "string",
                "null"
            ]
        },
        {
            "name": "_hoodie_commit_seqno",
            "type": [
                "string",
                "null"
            ]
        },
        {
            "name": "_hoodie_record_key",
            "type": [
                "string",
                "null"
            ]
        },
        {
            "name": "_hoodie_partition_path",
            "type": [
                "string",
                "null"
            ]
        },
        {
            "name": "_hoodie_file_name",
            "type": [
                "string",
                "null"
            ]
        },
        {
            "name": "id",
            "type": [
                "int",
                "null"
            ]
        },
        {
            "name": "name",
            "type": [
                "string",
                "null"
            ]
        },
        {
            "name": "price",
            "type": [
                "double",
                "null"
            ]
        },
        {
            "name": "ts",
            "type": [
                "long",
                "null"
            ]
        },
        {
            "name": "new_test_col",
            "type": [
                {
                    "type": "fixed",
                    "name": "fixed",
                    "namespace": "Record.new_test_col",
                    "size": 11,
                    "logicalType": "decimal",
                    "precision": 25,
                    "scale": 4
                },
                "null"
            ]
        },
        {
            "name": "dt",
            "type": [
                "string",
                "null"
            ]
        }
    ]
}

It can be saw in writer schema, the type of column new_test_col is a fixed type, and with namespace is hoodie.test_mor_tab.test_mor_tab_record.new_test_col.

{
    "name": "new_test_col",
    "type": {
        "type": "fixed",
        "name": "fixed",
        "namespace": "hoodie.test_mor_tab.test_mor_tab_record.new_test_col",
        "size": 11,
        "logicalType": "decimal",
        "precision": 25,
        "scale": 4
    },
    "doc": "a column for test decimal type"
}

But in reader schema, the type of column new_test_col is a union type, and with namespace isRecord.new_test_col.

{
    "name": "new_test_col",
    "type": [
        {
            "type": "fixed",
            "name": "fixed",
            "namespace": "Record.new_test_col",
            "size": 11,
            "logicalType": "decimal",
            "precision": 25,
            "scale": 4
        },
        "null"
    ]
}

According to Avro doc, UNION type is compatible in schema evolution with other primitive types. So, it is acceptable to read “fixed” type data with union type.

However, the namespace in reader schema is different with writer schema, it causes the exception mentioned above org.apache.avro.AvroTypeException: Found hoodie.test_mor_tab.test_mor_tab_record.new_test_col.fixed, expecting union.

If I replace the reader schema with the same namespace as writer schema, the test case can run properly.

...
    private RecordIterator(Schema readerSchema, Schema writerSchema, byte[] content, InternalSchema internalSchema) throws IOException {
      this.content = content;

      this.dis = new SizeAwareDataInputStream(new DataInputStream(new ByteArrayInputStream(this.content)));

      // 1. Read version for this data block
      int version = this.dis.readInt();
      HoodieAvroDataBlockVersion logBlockVersion = new HoodieAvroDataBlockVersion(version);

      Schema finalReadSchema = readerSchema;
      if (!internalSchema.isEmptySchema()) {
        // we should use write schema to read log file,
        // since when we have done some DDL operation, the readerSchema maybe different from writeSchema, avro reader will throw exception.
        // eg: origin writeSchema is: "a String, b double" then we add a new column now the readerSchema will be: "a string, c int, b double". it's wrong to use readerSchema to read old log file.
        // after we read those record by writeSchema,  we rewrite those record with readerSchema in AbstractHoodieLogRecordReader
        finalReadSchema = writerSchema;
      }

      Schema readSchema = new Schema.Parser().parse("{\"type\":\"record\",\"name\":\"Record\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":[\"string\",\"null\"]},{\"name\":\"_hoodie_commit_seqno\",\"type\":[\"string\",\"null\"]},{\"name\":\"_hoodie_record_key\",\"type\":[\"string\",\"null\"]},{\"name\":\"_hoodie_partition_path\",\"type\":[\"string\",\"null\"]},{\"name\":\"_hoodie_file_name\",\"type\":[\"string\",\"null\"]},{\"name\":\"id\",\"type\":[\"int\",\"null\"]},{\"name\":\"name\",\"type\":[\"string\",\"null\"]},{\"name\":\"price\",\"type\":[\"double\",\"null\"]},{\"name\":\"ts\",\"type\":[\"long\",\"null\"]},{\"name\":\"new_test_col\",\"type\":[{\"type\":\"fixed\",\"name\":\"fixed\",\"namespace\":\"hoodie.test_mor_tab.test_mor_tab_record.new_test_col\",\"size\":11,\"logicalType\":\"decimal\",\"precision\":25,\"scale\":4},\"null\"]},{\"name\":\"dt\",\"type\":[\"string\",\"null\"]}]}");
      this.reader = new GenericDatumReader<>(writerSchema, readSchema);

      if (logBlockVersion.hasRecordCount()) {
        this.totalRecords = this.dis.readInt();
      }
    }
...
Read more comments on GitHub >

github_iconTop Results From Across the Web

This issue can't be displayed right now - - ASF JIRA
HUDI-5271Inconsistent reader and writer schema in HoodieAvroDataBlock ... with nonpartitioned keygenerator in spark-sql will cause the colum to be null.
Read more >
[SUPPORT] Support for Schema evolution. Facing an error
The test cause the following exception, but i think they are quite related. Job aborted due to stage failure: Task 0 in stage...
Read more >
org.apache.avro.avrotypeexception - You.com
I tracked down the misbehavior to a reuse of a cached reader schema for [ TypeA ] ... reader and writer schema in...
Read more >
Schema Evolution and Compatibility - Confluent Documentation
The Confluent Schema Registry default compatibility type is BACKWARD , not BACKWARD_TRANSITIVE . The main reason that BACKWARD compatibility mode is the default ......
Read more >
Merging different schemas in Apache Spark - Medium
Schema changes by partition — image by author. The image above is showing the differences in each partition. As we can see, columns...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found