question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi 0.10.1 Reconcile schema not working

See original GitHub issue

Describe the problem you faced

we have struggle to add partitions to a Hudi table, when a column is missing in the new parquet files but exists in the schema of the existing table.

We processed the data in AWS via Glue and EMR and wrote the data into s3.
We used the Hudi 0.10.1 version on EMR version 6.6.0 and also packaged the Hudi spark3.1.2-bundle ourselves and run that on the cluster.

To Reproduce

We used the following spark session configs:

implicit val spark: SparkSession = SparkSession.builder

  .master(config.master)

  .appName("Hudi-Preprocessor")

  .config("spark.app.id", "Hudi-Preprocessor")

  .config("spark.sql.parquet.mergeSchema", value = true)

  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

  .config("spark.sql.hive.convertMetastoreParquet", "false")

  .config("spark.sql.sources.partitionColumnTypeInference.enabled", value = false)

  .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")

  .enableHiveSupport()

  .getOrCreate()

We read parquet files and want to rewrite them as Hudi table in another directory.
We process the first data set and create the initial Hudi table with it.
Then we update the table with the second data set, that misses column “price”.
Usually sparks just adds null values for such columns. But in this case we get the following error message.

// Create parquet example data, that will be processed.

Seq(

  Record("ok1", Some(100), "update", Timestamp.valueOf("2022-01-01 00:00:00"), "2022", "01", "01", "00"),

  Record("ok2", Some(200), "update", Timestamp.valueOf("2022-01-01 00:00:00"), "2022", "01", "01", "00"),

  Record("ok3", Some(300), "update", Timestamp.valueOf("2022-01-01 00:00:00"), "2022", "01", "01", "00"),
  
Record("ok3", Some(300), "delete", Timestamp.valueOf("2022-01-01 00:00:01"), "2022", "01", "01", "00"),

  Record("ok4", Some(400), "update", Timestamp.valueOf("2022-01-01 00:00:00"), "2022", "01", "01", "00"),

  Record("ok4", Some(500), "update", Timestamp.valueOf("2022-01-01 00:10:00"), "2022", "01", "01", "00"),

).toDF(„offer_key“, „price“, „action“, „event_time“)
  
.write

  .partitionBy("year", "month", "day", "hour")
  
.saveAsTable("offers")

//Record2 is missing the „price“ field, so a parquet file without this column is written
Seq(

  Record2(offer_key = "ok4", action = "update", event_time = Timestamp.valueOf("2022-01-01 01:10:00"))

).toDF()
  
.write

  .parquet(s"$rootPath/spark-warehouse/offers/year=2022/month=01/day=01/hour=01")


val hudiOptions = Map[String, String](

      HoodieWriteConfig.TBL_NAME.key() -> "my_table,
      DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE",
      DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "offer_key",
      DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "year,month,day",
      DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "event_time",
      DataSourceWriteOptions.HIVE_SYNC_ENABLED.key() -> "true", // extracted to method for mocking in local tests
      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true",
      DataSourceWriteOptions.HIVE_PARTITION_FIELDS.key() -> "year,month,day",
      DataSourceWriteOptions.HIVE_DATABASE.key() ->"my_table",
      DataSourceWriteOptions.HIVE_TABLE.key() -> "my_table",
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS.key() -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
      DataSourceWriteOptions.HIVE_SYNC_MODE.key() -> "hms",
      DataSourceWriteOptions.HIVE_SUPPORT_TIMESTAMP_TYPE.key() -> "true",
      HoodieIndexConfig.INDEX_TYPE.key() -> "GLOBAL_SIMPLE",
      DataSourceWriteOptions.RECONCILE_SCHEMA.key() -> "true“
)     

inputDf
.write

  .format("org.apache.hudi")
  
.option(DataSourceWriteOptions.OPERATION.key(), DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
 
 .options(hudiOptions)

  .mode(SaveMode.Append)

  .save("<path-to-table>/my_table")

Expected behavior

All records are written into the Hudi table, the missing column in the second write operation (update) is defaulted to null.

Environment Description

  • Hudi version : 0.10.1

  • Spark version : 3.1.2

  • Hive version : 3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) : S3 (for production), local for reproducing

  • Running on Docker? (yes/no) : no

Additional context

With Hudi 0.11.0 this problem was solved, unfortunately we couldn’t upgrade to this version because of #5715

Stacktrace

scala.MatchError: false (of class java.lang.Boolean)
at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:308)
at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$14(AvroConversionHelper.scala:373)
at org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:157)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
22/05/30 14:19:41 WARN TaskSetManager: Lost task 0.0 in stage 34.0 (TID 1333) (w0duh.fritz.box executor driver): scala.MatchError: false (of class java.lang.Boolean)
at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$7(AvroConversionHelper.scala:308)
at org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$14(AvroConversionHelper.scala:373)
at org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:157)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
mgerlachcommented, Jun 20, 2022

@nsivabalan @xushiyan - @jdiebold is out of office for a while, will try to get to #5715 (with @rmehlitz) first, asap (currently a bit short of time)

0reactions
nsivabalancommented, Nov 7, 2022

I see. If you would like to understand any details on hudi, or need help w/ any Poc or onboarding let us know. CC @bhasudha

Read more comments on GitHub >

github_iconTop Results From Across the Web

All Configurations | Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >
Troubleshooting - Apache Hudi
This error generally occurs when the schema has evolved in backwards incompatible way by deleting some column 'col1' and we are trying to...
Read more >
Configurations - Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >
FAQs - Apache Hudi
As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, Hadoop 2.7+ (not Hadoop 3). How does Hudi actually store data inside...
Read more >
Release 0.10.1 - Apache Hudi
This release (0.10.1) does not introduce any new table version, ... In the previous release (0.10.0), we added Spark 3.1.x support and made...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found