question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive sync issue)

See original GitHub issue

Describe the problem you faced

I was trying to use hudi with AWS Glue

At first, i create a simple dataframe

from pyspark.sql import Row
import time

ut = time.time()

product = [
    {'product_id': '00001', 'product_name': 'Heater', 'price': 250, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00002', 'product_name': 'Thermostat', 'price': 400, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00003', 'product_name': 'Television', 'price': 600, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00004', 'product_name': 'Blender', 'price': 100, 'category': 'Electronics', 'updated_at': ut},
    {'product_id': '00005', 'product_name': 'USB chargers', 'price': 50, 'category': 'Electronics-t2', 'updated_at': ut}
]

df_products = spark.createDataFrame(Row(**x) for x in product)

then set the hudi config

hudi_options = {
    'hoodie.table.name': 'Customer_Sample_Hudi',
    'hoodie.datasource.write.storage.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.recordkey.field': 'product_id',
    'hoodie.datasource.write.partitionpath.field': 'product_id',
    'hoodie.datasource.write.table.name': 'Customer_Sample_Hudi',
    'hoodie.datasource.write.operation': 'insert_overwrite',
    'hoodie.datasource.write.precombine.field': 'updated_at',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.upsert.shuffle.parallelism': 2,
    'hoodie.insert.shuffle.parallelism': 2,
    'path': 's3://my_staging_bucket/Customer_Sample_Hudi/',
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.database': 'default',
    'hoodie.datasource.hive_sync.table': 'Customer_Sample_Hudi',
    'hoodie.datasource.hive_sync.partition_fields': 'product_id',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.mode': 'hms'
}

then write the dataframe to AWS s3

df_products.write.format("hudi")  \
    .options(**hudi_options)  \
    .mode("append")  \
    .save()

sometimes it could write the dataframe successfully

sometimes it would fail (with the same dataframe but different table name)

please see the error message below

the data was write to AWS S3 but meet some error

i’ve checked that the table name not exists in aws glue catalog before writing it

the path in aws s3 was not exist either To Reproduce

Steps to reproduce the behavior:

Steps are above

Expected behavior

To write the dataset with hudi successfully

Environment Description

  • Hudi version : 0.10.1

  • Spark version : AWS Glue 3.0 with Spark 3.1.1-amzn-0

  • Hive version : glue catalog

  • Hadoop version : 3.2.1-amzn-3

  • Storage (HDFS/S3/GCS…) : AWS S3

  • Running on Docker? (yes/no) : no

Additional context

I was follow this guideline refer

using 3 jars

hudi-utilities-bundle_2.12-0.10.1.jar hudi-spark3.1.1-bundle_2.12-0.10.1.jar spark-avro_2.12-3.1.1.jar

Stacktrace

Py4JJavaError: An error occurred while calling o235.save.
: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing Customer_Sample_Hudi
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:118)
	at org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:539)
	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:595)
	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2$adapted(HoodieSparkSqlWriter.scala:591)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:77)
	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:591)
	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:665)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:286)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table Customer_Sample_Hudi
	at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:363)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:197)
	at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:129)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:115)
	... 46 more
Caused by: java.lang.IllegalArgumentException: Partitions must be in the same table
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.validateInputForBatchCreatePartitions(GlueMetastoreClientDelegate.java:800)
	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.batchCreatePartitions(GlueMetastoreClientDelegate.java:736)
	at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.addPartitions(GlueMetastoreClientDelegate.java:718)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.add_partitions(AWSCatalogMetastoreClient.java:339)
	at org.apache.hudi.hive.ddl.HMSDDLExecutor.addPartitionsToTable(HMSDDLExecutor.java:198)
	at org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:115)
	at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:346)
	... 49 more

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, Oct 22, 2022

sounds like a good enhacenemtn. have created a jira https://issues.apache.org/jira/browse/HUDI-5074. one of us will take it up. thanks!

1reaction
dragonHcommented, Oct 3, 2022

@kazdy thanks for the information!

wonder if there’s better way to avoid this kind of issue

e.g.

  • add a new config property AWSGlueDataCataglogEnabled and if it set as True, convert the table name to lowercase at the begin and at the rest

  • or edit the table-name-related config property hint to highlight this (if using aws glue data catalog, table name sould be lowercase)

cause the original one and the exception raised could not really tell about this

thanks for your help !

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using the Hudi framework in AWS Glue
This example script demonstrates how to write a Hudi table to Amazon S3 and register the table to the AWS Glue Data Catalog....
Read more >
Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not ...
Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not found. 0. I'm syncing data written to S3 using Apache Hudi with Hive & Glue. Hudi options:
Read more >
FAQs | Apache Hudi
As of September 2019, Hudi can support Spark 2.1+, Hive 2.x, Hadoop 2.7+ (not Hadoop 3). How does Hudi actually store data inside...
Read more >
Hudi 0.11 + AWS Glue doesn't work yet. | by Life-is-short--so
In short, Metadata sync + Glue Data Catalog fails with this exception. java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.
Read more >
Use AWS Glue Data Catalog as a metastore
Higher latency with Glue Catalog than Databricks Hive metastore. No instance profile attached to the Databricks Runtime cluster. Insufficient Glue Catalog ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found