Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hive Sync to Glue throws Failed to read data schema

See original GitHub issue

To Reproduce

I am getting an exception on Hive Sync to Glue Catalog, when writing empty dataframe with schema to S3.

Steps to reproduce the behavior:

Create empty dataframe with schema, for example:

val schema: StructType = ???
val df = session.createDataFrame(session.sparkContext.emptyRDD[Row], schema)

Write this dataframe to S3 with Hive sync enabled. For example:

df.write.format("hudi")
  .options(options)
  .mode(SaveMode.Overwrite)
  .save("s3://my-datalake/my-table...")

My Hudi writer options:

  val initLoadConfig = Map(
    BULKINSERT_PARALLELISM -> "4",
    INSERT_PARALLELISM -> "4",
    UPSERT_PARALLELISM -> "4",
    DELETE_PARALLELISM -> "4"
  )
  val unpartitionDataConfig = Map(
    HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> "org.apache.hudi.hive.NonPartitionedExtractor",
    KEYGENERATOR_CLASS_PROP -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator"
  )

  def writerOptions(
      table: String,
      primaryKey: String,
      database: String,
  ) = {
    Map(
      OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
      PRECOMBINE_FIELD_PROP -> "some field here",
      RECORDKEY_FIELD_OPT_KEY -> primaryKey,
      TABLE_NAME -> table,
      "hoodie.consistency.check.enabled" -> "true",
      ENABLE_ROW_WRITER_OPT_KEY -> "true",
      HIVE_USE_JDBC_OPT_KEY -> "true",
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HIVE_SUPPORT_TIMESTAMP -> "true",
      HIVE_DATABASE_OPT_KEY -> database,
      HIVE_TABLE_OPT_KEY -> table
    ) ++ initLoadConfig ++ unpartitionDataConfig
  }

Expected behavior

Hudi table is registered in AWS Glue Catalog as external table.

Environment Description

Hudi version : 0.7
Spark version : 3.1.1
Hive version : AWS Glue Catalog
Hadoop version : EMR 6.3.0
Storage (HDFS/S3/GCS…) : S3
Running on Docker? (yes/no) : no

** Additional Context **

Inside the baspath/.hoodie/ I have the following:

                           PRE .aux/
                           PRE .temp/
                           PRE archived/
2021-09-07 11:48:45        407 20210907094837.commit
2021-09-07 11:48:42          0 20210907094837.commit.requested
2021-09-07 11:48:43          0 20210907094837.inflight
2021-09-07 11:48:40        234 hoodie.properties

Commit file:

cat 20210907094837.commit 
{
  "partitionToWriteStats" : { },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : null
  },
  "operationType" : "BULK_INSERT",
  "fileIdAndRelativePaths" : { },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 0
}

Stacktrace

21/09/07 09:48:47 WARN HiveSyncTool: Set partitionFields to empty, since the NonPartitionedExtractor is used
21/09/07 09:48:47 ERROR HiveSyncTool: Got runtime exception when hive syncing
org.apache.hudi.sync.common.HoodieSyncException: Failed to read data schema
	at org.apache.hudi.sync.common.AbstractSyncHoodieClient.getDataSchema(AbstractSyncHoodieClient.java:121)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:134)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:355)
	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4(HoodieSparkSqlWriter.scala:403)
	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4$adapted(HoodieSparkSqlWriter.scala:399)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
	at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:311)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:127)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

novakov-alexey-zzcommented, Sep 27, 2021

@xushiyan thank you, it is indeed fixed in Hudi 0.9.

If someone needs an example of building EMR job with own Hudi in SBT, there is one: https://github.com/novakov-alexey/spark-elt-jobs/blob/main/build.sbt#L207

0reactions

xushiyancommented, Sep 27, 2021

@novakov-alexey I checked the behavior is fixed in 0.9.0. Please give release-0.9.0 a try. You can find some guide here to override EMR hudi jars. https://hudi.apache.org/learn/faq#how-to-override-hudi-jars-in-emr

I used this snippet to reproduce your scenario with 0.9.0. After the bulk insert commit, I can see schema value is populated in the commit file. This should allow you to sync with Glue Catalog.

    val opts = Map(
      DataSourceWriteOptions.TABLE_NAME.key() -> "language",
      DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
      DataSourceWriteOptions.OPERATION.key() -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
      DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "lang",
      DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "score",
      DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> classOf[NonpartitionedKeyGenerator].getCanonicalName,
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS.key() -> classOf[NonPartitionedExtractor].getCanonicalName,
      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true",
      HoodieWriteConfig.TBL_NAME.key() -> "language",
      HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key() -> "1",
      HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key() -> "1",
      HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key() -> "1",
      HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM_VALUE.key() -> "1",
      "spark.default.parallelism" -> "1",
      "spark.sql.shuffle.partitions" -> "1"
    )

    val simpleSchema = StructType(Array(
      StructField("lang", StringType, nullable = false),
      StructField("score", IntegerType, nullable = false)
    ))
    val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], simpleSchema)
    emptyDF.write.format("hudi").options(opts).mode(Overwrite).save(basePath)
    val data = spark.read.format("hudi").load(basePath)
    data.show()

➜ cat .hoodie/20210926212533.commit
{
  "partitionToWriteStats" : { },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"language_record\",\"namespace\":\"hoodie.language\",\"fields\":[{\"name\":\"lang\",\"type\":\"string\"},{\"name\":\"score\",\"type\":\"int\"}]}"
  },
  "operationType" : "BULK_INSERT",
  "fileIdAndRelativePaths" : { },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ ]
}

Top Results From Across the Web

[GitHub] [hudi] xushiyan commented on issue #3617: [SUPPORT ...

... issue #3617: [SUPPORT] Hive Sync to Glue throws Failed to read data schema ... Schema reader is reading from commit file. when...

[SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive ...

This is strange. The hive sync configs look fine. In the stacktrace, i see partition sync errors out for the same table name...

Troubleshoot Athena query failing with the error ...

When you run a query on an Athena partitioned table, Athena validates the table schema and the schema of its partitions in the...

Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not ...

I'm syncing data written to S3 using Apache Hudi with Hive & Glue. ... Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not found. 0. I'm syncing...

Hive Query throwing error input string: "__HIVE_D" is not an ...

I have a Hive table that is using AWS Glue metastore. The data resides on S3 and we partition by year,month and unique...