[SUPPORT] Hive Sync to Glue throws Failed to read data schema
See original GitHub issueTo Reproduce
I am getting an exception on Hive Sync to Glue Catalog, when writing empty dataframe with schema to S3.
Steps to reproduce the behavior:
- Create empty dataframe with schema, for example:
val schema: StructType = ???
val df = session.createDataFrame(session.sparkContext.emptyRDD[Row], schema)
- Write this dataframe to S3 with Hive sync enabled. For example:
df.write.format("hudi")
.options(options)
.mode(SaveMode.Overwrite)
.save("s3://my-datalake/my-table...")
My Hudi writer options:
val initLoadConfig = Map(
BULKINSERT_PARALLELISM -> "4",
INSERT_PARALLELISM -> "4",
UPSERT_PARALLELISM -> "4",
DELETE_PARALLELISM -> "4"
)
val unpartitionDataConfig = Map(
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> "org.apache.hudi.hive.NonPartitionedExtractor",
KEYGENERATOR_CLASS_PROP -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator"
)
def writerOptions(
table: String,
primaryKey: String,
database: String,
) = {
Map(
OPERATION_OPT_KEY -> BULK_INSERT_OPERATION_OPT_VAL,
PRECOMBINE_FIELD_PROP -> "some field here",
RECORDKEY_FIELD_OPT_KEY -> primaryKey,
TABLE_NAME -> table,
"hoodie.consistency.check.enabled" -> "true",
ENABLE_ROW_WRITER_OPT_KEY -> "true",
HIVE_USE_JDBC_OPT_KEY -> "true",
HIVE_SYNC_ENABLED_OPT_KEY -> "true",
HIVE_SUPPORT_TIMESTAMP -> "true",
HIVE_DATABASE_OPT_KEY -> database,
HIVE_TABLE_OPT_KEY -> table
) ++ initLoadConfig ++ unpartitionDataConfig
}
Expected behavior
Hudi table is registered in AWS Glue Catalog as external table.
Environment Description
-
Hudi version : 0.7
-
Spark version : 3.1.1
-
Hive version : AWS Glue Catalog
-
Hadoop version : EMR 6.3.0
-
Storage (HDFS/S3/GCS…) : S3
-
Running on Docker? (yes/no) : no
** Additional Context **
Inside the baspath/.hoodie/
I have the following:
PRE .aux/
PRE .temp/
PRE archived/
2021-09-07 11:48:45 407 20210907094837.commit
2021-09-07 11:48:42 0 20210907094837.commit.requested
2021-09-07 11:48:43 0 20210907094837.inflight
2021-09-07 11:48:40 234 hoodie.properties
Commit file:
cat 20210907094837.commit
{
"partitionToWriteStats" : { },
"compacted" : false,
"extraMetadata" : {
"schema" : null
},
"operationType" : "BULK_INSERT",
"fileIdAndRelativePaths" : { },
"totalRecordsDeleted" : 0,
"totalLogRecordsCompacted" : 0,
"totalLogFilesCompacted" : 0,
"totalCompactedRecordsUpdated" : 0,
"totalLogFilesSize" : 0,
"totalScanTime" : 0,
"totalCreateTime" : 0,
"totalUpsertTime" : 0
}
Stacktrace
21/09/07 09:48:47 WARN HiveSyncTool: Set partitionFields to empty, since the NonPartitionedExtractor is used
21/09/07 09:48:47 ERROR HiveSyncTool: Got runtime exception when hive syncing
org.apache.hudi.sync.common.HoodieSyncException: Failed to read data schema
at org.apache.hudi.sync.common.AbstractSyncHoodieClient.getDataSchema(AbstractSyncHoodieClient.java:121)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:134)
at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:355)
at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4(HoodieSparkSqlWriter.scala:403)
at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$4$adapted(HoodieSparkSqlWriter.scala:399)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:399)
at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:311)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:127)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
[GitHub] [hudi] xushiyan commented on issue #3617: [SUPPORT ...
... issue #3617: [SUPPORT] Hive Sync to Glue throws Failed to read data schema ... Schema reader is reading from commit file. when...
Read more >[SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive ...
This is strange. The hive sync configs look fine. In the stacktrace, i see partition sync errors out for the same table name...
Read more >Troubleshoot Athena query failing with the error ...
When you run a query on an Athena partitioned table, Athena validates the table schema and the schema of its partitions in the...
Read more >Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not ...
I'm syncing data written to S3 using Apache Hudi with Hive & Glue. ... Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not found. 0. I'm syncing...
Read more >Hive Query throwing error input string: "__HIVE_D" is not an ...
I have a Hive table that is using AWS Glue metastore. The data resides on S3 and we partition by year,month and unique...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@xushiyan thank you, it is indeed fixed in Hudi 0.9.
If someone needs an example of building EMR job with own Hudi in SBT, there is one: https://github.com/novakov-alexey/spark-elt-jobs/blob/main/build.sbt#L207
@novakov-alexey I checked the behavior is fixed in 0.9.0. Please give release-0.9.0 a try. You can find some guide here to override EMR hudi jars. https://hudi.apache.org/learn/faq#how-to-override-hudi-jars-in-emr
I used this snippet to reproduce your scenario with 0.9.0. After the bulk insert commit, I can see schema value is populated in the commit file. This should allow you to sync with Glue Catalog.