[SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path
See original GitHub issueHi All,
I’m currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 (soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).
In Iceberg, you are able to do the following to query the Glue catalog:
df = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"path": "my_catalog.my_glue_database.my_iceberg_table",
"connectionName": "Iceberg Connector for Glue 3.0",
},
transformation_ctx="IcebergDyF",
).toDF()
I’d like to do something similar with Hudi:
df = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options= {
"className": "org.apache.hudi",
"hoodie.table.name": "my_hudi_table",
"hoodie.consistency.check.enabled": "true",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.database": "my_glue_database",
"hoodie.datasource.hive_sync.table": "my_hudi_table",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.partition_fields": "YYYY,MM,DD"
},
transformation_ctx="HudiDyF",
)
Meaning we don’t need to grab the S3 path of our data from boto3 every time, like so:
client = boto3.client('glue')
response = client.get_table( # <<----- don't want this
DatabaseName='my_glue_database',
Name='my_hudi_table'
)
targetPath = response['Table']['StorageDescriptor']['Location'] # <<----- or this
df = glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options= {
"className": "org.apache.hudi",
"path": targetPath, # <<----- or this
"hoodie.table.name": "my_hudi_table",
"hoodie.consistency.check.enabled": "true",
"hoodie.datasource.hive_sync.use_jdbc": "false",
"hoodie.datasource.hive_sync.database": "my_glue_database",
"hoodie.datasource.hive_sync.table": "my_hudi_table",
"hoodie.datasource.hive_sync.enable": "true",
"hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
"hoodie.datasource.hive_sync.partition_fields": "YYY,MM,DD"
},
transformation_ctx="HudiDyF",
)
# OR
sourceTableDF = spark.read.format('hudi').load(targetPath)
Is there any way to do this? Very new to Hudi, so if my configuration settings are wrong and this is possible, please let me know!
EDIT: It’s worth mentioning that the above, including path, throws the error below; however that may be my configuration?
Py4JJavaError: An error occurred while calling o90.getSource.
: org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table
at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:288)
at org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:345)
at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:420)
at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:214)
at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:149)
at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:116)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:67)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadSparkDataSource(CustomDataSourceFactory.scala:89)
at com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadDataSource(CustomDataSourceFactory.scala:33)
at com.amazonaws.services.glue.GlueContext.getCustomSource(GlueContext.scala:159)
at com.amazonaws.services.glue.GlueContext.getSourceInternal(GlueContext.scala:910)
at com.amazonaws.services.glue.GlueContext.getSource(GlueContext.scala:753)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
- object not serializable (class: org.apache.hadoop.fs.Path, value: s3://olympus-dev-data-refined/clientcontact_hudi_v6)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(s3://olympus-dev-data-refined/clientcontact_hudi_v6))
- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@691)
- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
- object (class org.apache.spark.scheduler.ResultTask, ResultTask(0, 0))
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73)
at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:81)
at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:286)
... 27 more
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Using the Hudi framework in AWS Glue
You can use AWS Glue to perform read and write operations on Hudi tables in Amazon S3, or work with Hudi tables using...
Read more >Hudi Merge on Read(MoR) - EMR Workshop
This lab demonstrates using PySpark on Apache Hudi on Amazon EMR to ... Sync the Hudi tables to the Hive/Glue Catalog; Upsert some...
Read more >Using Athena to query Apache Hudi datasets - 亚马逊云科技
In your CREATE TABLE statement, specify the Hudi table path in your LOCATION clause. ... Using MSCK REPAIR TABLE on Hudi tables in...
Read more >Hive Connector — Presto 0.278 Documentation
However, Kerberos authentication by ticket cache is not yet supported. ... The Hive Connector can read and write tables that are stored in...
Read more >EMR Hudi cannot create hive connection jdbc:hive2 ...
I assume you are following the tutorial from AWS documentation. I got it to work using Hudi 0.9.0 by setting hive_sync.mode to hms...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@stevenayers you should be able to use the glue catalog to load hudi table like any other hive external table.
See if you can emulate the below for you needs.
Read dataframe from source
input_dyf = glueContext.create_dynamic_frame.from_catalog( database=src_database, table_name=src_table_name, push_down_predicate=f"(sdwh_update_year = ‘{start_date[:4]}’ and sdwh_update_month = ‘{start_date[5:7]}’ and sdwh_update_day = ‘{start_date[8:10]}’)", transformation_ctx=“datasource0”, additional_options={“useS3ListImplementation”: True, “groupFiles”: “inPartition”, “boundedSize”: “6516192768”}, )
@rkkalluri Thanks for the help! Closing this issues. @stevenayers feel free to reopen this or file a new issue if you face more problems.