question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hudi failed to sync new partition table to glue data catalog

See original GitHub issue

Describe the problem you faced

I am trying to bulk_insert a small table (~150MB) into s3 using Apache hudi. I want to partition the data based on created field with format yyyy/MM/dd using hive_style_partitioning. The table (with partition subfolder) is created successfully on S3, However, Hudi failed with the following :

Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table brand_tree
...
Caused by: java.lang.IllegalArgumentException: Partition key parts [created] does not match with partition values [2019, 01, 17]. Check partition strategy. 

here my bulk_insert configuration:

{
'hoodie.bulkinsert.shuffle.parallelism': 3,
'hoodie.datasource.write.operation': 'bulk_insert',
'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator',
 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
'hoodie.deltastreamer.keygen.timebased.timestamp.type': 'DATE_STRING',
'hoodie.deltastreamer.keygen.timebased.output.dateformat': 'yyyy/MM/dd',
'hoodie.deltastreamer.keygen.timebased.input.dateformat': 'yyyy-MM-dd HH:mm:ss',
'hoodie.datasource.write.partitionpath.field': f'{partition_key}:TIMESTAMP',
 'hoodie.datasource.hive_sync.partition_fields': f'{partition_key}',
'hoodie.datasource.write.hive_style_partitioning': 'true',
'hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex': '',
'hoodie.deltastreamer.keygen.timebased.input.timezone': 'GMT',
'className': 'org.apache.hudi', 'hoodie.datasource.hive_sync.use_jdbc': 'false',
'hoodie.datasource.write.precombine.field': pre_combine_key,
'hoodie.datasource.write.recordkey.field': ','.join(record_keys),
'hoodie.table.name': table_name,
'hoodie.consistency.check.enabled': 'true', 'hoodie.datasource.hive_sync.database': db_name,
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.hive_sync.table': table_name, 'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.write.table.type': "COPY_ON_WRITE",
'hoodie.index.type': "GLOBAL_SIMPLE"
}

And here the Python:Traceback:

2021-11-16 11:26:57,813 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(70)): Error from Python:Traceback (most recent call last):
  File "/tmp/dev-poc", line 357, in <module>
    main()
  File "/tmp/dev-poc", line 353, in main
    rds_job_driver.run()
  File "/tmp/dev-poc", line 228, in run
    self.transform()
  File "/tmp/dev-poc", line 330, in transform
    partition_key=partition_key)
  File "/tmp/dev-poc", line 176, in overwrite
    self.write_df(data_frame, write_mode, target_path, self.data_format, **combined_conf)
  File "/tmp/dev-poc", line 91, in write_df
    save(target)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 734, in save
    self._jwrite.save(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o164.save.
: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing brand_tree
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:132)
	at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:425)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:479)
	at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:475)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:475)
	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:548)
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:238)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:170)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table brand_tree
	at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:332)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:188)
	at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:118)
	... 40 more
Caused by: java.lang.IllegalArgumentException: Partition key parts [created] does not match with partition values [2019, 01, 17]. Check partition strategy. 
	at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
	at org.apache.hudi.hive.HoodieHiveClient.getPartitionClause(HoodieHiveClient.java:223)
	at org.apache.hudi.hive.HoodieHiveClient.constructAddPartitions(HoodieHiveClient.java:199)
	at org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:143)
	at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:327)
	... 42 more

1. sample partition path: s3://schema_name/table_name/created=2021/11/10/ -> this will have parquet files as output 2. pertaining using hive style 3. I am converting partition field (created) to string before ingesting/save to hudi table on S3

Environment Description

  • Hudi version : 0.9.0

  • Spark version : 2.4/ PySpark

  • Hadoop version : 2.7.3

  • Storage (HDFS/S3/GCS…) : S3

  • Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
mtamicommented, Nov 20, 2021

for other who might encounter similar issue, you need specify –enable-glue-datacatalog option in the glue job parameter in order to to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. This was not clear from AWS glue job documentation at the first glance.

0reactions
xushiyancommented, Nov 20, 2021

@mtami From the logs it showed that hive sync was successful. I suggest you double check the glue table in aws in the desired region for the database you set. It could be permission issue too. Please engage with AWS support to investigate. From Hudi’s side, don’t see there is an issue so will close this. Feel free to follow up here if you have further info to share. Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SUPPORT] AWS Glue 3.0 fail to write dataset with hudi (hive ...
This is strange. The hive sync configs look fine. In the stacktrace, i see partition sync errors out for the same table name...
Read more >
Re: Hudi on EMR syncing GLUE catalog issue
Hi Igor, As of current implementation, Hudi submits queries like creating table, syncing partitions etc directly to the hive server instead of directly ......
Read more >
Using the Hudi framework in AWS Glue
This example script demonstrates how to write a Hudi table to Amazon S3 and register the table to the AWS Glue Data Catalog....
Read more >
Hive Sync fails: AWSGlueDataCatalogHiveClientFactory not ...
I'm syncing data written to S3 using Apache Hudi with Hive & Glue. Hudi options: hudi_options: 'hoodie.table.name': mytable ...
Read more >
[HUDI-2757] Extend the hive sync tool to enable AWS Glue ...
isAwsGlueMetaSyncEnabled) { + LOG.info("Syncing Hudi Table with the ... is AWS Glue Data Catalog. https://docs.aws.amazon.com/glue/latest/dg ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found