question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hive Sync to Glue throws Failed to read data schema

See original GitHub issue

To Reproduce

I am getting an exception on Hive Sync to Glue Catalog, when writing empty dataframe with schema to S3.

Steps to reproduce the behavior:

  1. Create empty dataframe with schema, for example:
val schema: StructType = ???
val df = session.createDataFrame(session.sparkContext.emptyRDD[Row], schema)
  1. Write this dataframe to S3 with Hive sync enabled. For example:
df.write.format("hudi")
  .options(options)
  .mode(SaveMode.Overwrite)
  .save("s3://my-datalake/my-table...")

My Hudi writer options:

  val initLoadConfig = Map(
    BULKINSERT_PARALLELISM -> "4",
    INSERT_PARALLELISM -> "4",
    UPSERT_PARALLELISM -> "4",
    DELETE_PARALLELISM -> "4"
  )
  val unpartitionDataConfig = Map(
    HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> "org.apache.hudi.hive.NonPartitionedExtractor",
    KEYGENERATOR_CLASS_PROP -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator"
  )

  def writerOptions(
      table: String,
      primaryKey: String,
      database: String,
  ) = {
    Map(
      OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
      PRECOMBINE_FIELD_PROP -> "some field here",
      RECORDKEY_FIELD_OPT_KEY -> primaryKey,
      TABLE_NAME -> table,
      "hoodie.consistency.check.enabled" -> "true",
      ENABLE_ROW_WRITER_OPT_KEY -> "true",
      HIVE_USE_JDBC_OPT_KEY -> "true",
      HIVE_SYNC_ENABLED_OPT_KEY -> "true",
      HIVE_SUPPORT_TIMESTAMP -> "true",
      HIVE_DATABASE_OPT_KEY -> database,
      HIVE_TABLE_OPT_KEY -> table
    ) ++ initLoadConfig ++ unpartitionDataConfig
  }

Expected behavior

Hudi table is registered in AWS Glue Catalog as external table.

Environment Description

  • Hudi version : 0.7

  • Spark version : 3.1.1

  • Hive version : AWS Glue Catalog

  • Hadoop version : EMR 6.3.0

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

** Additional Context **

Inside the baspath/.hoodie/ I have the following:

                           PRE .aux/
                           PRE .temp/
                           PRE archived/
2021-09-07 11:48:45        407 20210907094837.commit
2021-09-07 11:48:42          0 20210907094837.commit.requested
2021-09-07 11:48:43          0 20210907094837.inflight
2021-09-07 11:48:40        234 hoodie.properties

Commit file:

cat 20210907094837.commit 
{
  "partitionToWriteStats" : { },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : null
  },
  "operationType" : "BULK_INSERT",
  "fileIdAndRelativePaths" : { },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 0
}

Stacktrace

21/09/07 09:48:47 WARN HiveSyncTool: Set partitionFields to empty, since the NonPartitionedExtractor is used
21/09/07 09:48:47 ERROR HiveSyncTool: Got runtime exception when hive syncing
org.apache.hudi.sync.common.HoodieSyncException: Failed to read data schema
	at org.apache.hudi.sync.common.AbstractSyncHoodieClient.getDataSchema(AbstractSyncHoodieClient.java:121)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:134)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:355)
	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4(HoodieSparkSqlWriter.scala:403)
	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4$adapted(HoodieSparkSqlWriter.scala:399)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
	at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:311)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:127)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
novakov-alexey-zzcommented, Sep 27, 2021

@xushiyan thank you, it is indeed fixed in Hudi 0.9.

If someone needs an example of building EMR job with own Hudi in SBT, there is one: https://github.com/novakov-alexey/spark-elt-jobs/blob/main/build.sbt#L207

0reactions
xushiyancommented, Sep 27, 2021

@novakov-alexey I checked the behavior is fixed in 0.9.0. Please give release-0.9.0 a try. You can find some guide here to override EMR hudi jars. https://hudi.apache.org/learn/faq#how-to-override-hudi-jars-in-emr

I used this snippet to reproduce your scenario with 0.9.0. After the bulk insert commit, I can see schema value is populated in the commit file. This should allow you to sync with Glue Catalog.

    val opts = Map(
      DataSourceWriteOptions.TABLE_NAME.key() -> "language",
      DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
      DataSourceWriteOptions.OPERATION.key() -> DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
      DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "lang",
      DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "score",
      DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() -> classOf[NonpartitionedKeyGenerator].getCanonicalName,
      DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS.key() -> classOf[NonPartitionedExtractor].getCanonicalName,
      DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key() -> "true",
      HoodieWriteConfig.TBL_NAME.key() -> "language",
      HoodieWriteConfig.INSERT_PARALLELISM_VALUE.key() -> "1",
      HoodieWriteConfig.UPSERT_PARALLELISM_VALUE.key() -> "1",
      HoodieWriteConfig.BULKINSERT_PARALLELISM_VALUE.key() -> "1",
      HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM_VALUE.key() -> "1",
      "spark.default.parallelism" -> "1",
      "spark.sql.shuffle.partitions" -> "1"
    )

    val simpleSchema = StructType(Array(
      StructField("lang", StringType, nullable = false),
      StructField("score", IntegerType, nullable = false)
    ))
    val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], simpleSchema)
    emptyDF.write.format("hudi").options(opts).mode(Overwrite).save(basePath)
    val data = spark.read.format("hudi").load(basePath)
    data.show()
➜ cat .hoodie/20210926212533.commit
{
  "partitionToWriteStats" : { },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"language_record\",\"namespace\":\"hoodie.language\",\"fields\":[{\"name\":\"lang\",\"type\":\"string\"},{\"name\":\"score\",\"type\":\"int\"}]}"
  },
  "operationType" : "BULK_INSERT",
  "fileIdAndRelativePaths" : { },
  "totalRecordsDeleted" : 0,
  "totalLogRecordsCompacted" : 0,
  "totalLogFilesCompacted" : 0,
  "totalCompactedRecordsUpdated" : 0,
  "totalLogFilesSize" : 0,
  "totalScanTime" : 0,
  "totalCreateTime" : 0,
  "totalUpsertTime" : 0,
  "minAndMaxEventTime" : {
    "Optional.empty" : {
      "val" : null,
      "present" : false
    }
  },
  "writePartitionPaths" : [ ]
}
Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] xushiyan commented on issue #3617: [SUPPORT ...
... issue #3617: [SUPPORT] Hive Sync to Glue throws Failed to read data schema ... Schema reader is reading from commit file. when...
Read more >
[SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive ...
This is strange. The hive sync configs look fine. In the stacktrace, i see partition sync errors out for the same table name...
Read more >
Troubleshoot Athena query failing with the error ...
When you run a query on an Athena partitioned table, Athena validates the table schema and the schema of its partitions in the...
Read more >
Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not ...
I'm syncing data written to S3 using Apache Hudi with Hive & Glue. ... Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not found. 0. I'm syncing...
Read more >
Hive Query throwing error input string: "__HIVE_D" is not an ...
I have a Hive table that is using AWS Glue metastore. The data resides on S3 and we partition by year,month and unique...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found