question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

See original GitHub issue

Hi All,

I’m currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 (soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).

In Iceberg, you are able to do the following to query the Glue catalog:

df = glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options={
            "path": "my_catalog.my_glue_database.my_iceberg_table",
            "connectionName": "Iceberg Connector for Glue 3.0",
        },
        transformation_ctx="IcebergDyF",
    ).toDF()

I’d like to do something similar with Hudi:

df = glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options= {
            "className": "org.apache.hudi",
            "hoodie.table.name": "my_hudi_table",
            "hoodie.consistency.check.enabled": "true",
            "hoodie.datasource.hive_sync.use_jdbc": "false",
            "hoodie.datasource.hive_sync.database": "my_glue_database",
            "hoodie.datasource.hive_sync.table":  "my_hudi_table",
            "hoodie.datasource.hive_sync.enable": "true",
            "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "hoodie.datasource.hive_sync.partition_fields": "YYYY,MM,DD"
        },
        transformation_ctx="HudiDyF",
    )

Meaning we don’t need to grab the S3 path of our data from boto3 every time, like so:

client = boto3.client('glue')
response = client.get_table( # <<----- don't want this
    DatabaseName='my_glue_database',
    Name='my_hudi_table'
) 
targetPath = response['Table']['StorageDescriptor']['Location'] # <<----- or this
df = glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options= {
            "className": "org.apache.hudi",
            "path": targetPath, # <<----- or this
            "hoodie.table.name": "my_hudi_table",
            "hoodie.consistency.check.enabled": "true",
            "hoodie.datasource.hive_sync.use_jdbc": "false",
            "hoodie.datasource.hive_sync.database": "my_glue_database",
            "hoodie.datasource.hive_sync.table":  "my_hudi_table",
            "hoodie.datasource.hive_sync.enable": "true",
            "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "hoodie.datasource.hive_sync.partition_fields": "YYY,MM,DD"
        },
        transformation_ctx="HudiDyF",
    )
# OR
sourceTableDF = spark.read.format('hudi').load(targetPath)

Is there any way to do this? Very new to Hudi, so if my configuration settings are wrong and this is possible, please let me know!

EDIT: It’s worth mentioning that the above, including path, throws the error below; however that may be my configuration?

Py4JJavaError: An error occurred while calling o90.getSource.
: org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table
	at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:288)
	at org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:345)
	at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:420)
	at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:214)
	at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:149)
	at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:116)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:67)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
	at com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadSparkDataSource(CustomDataSourceFactory.scala:89)
	at com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadDataSource(CustomDataSourceFactory.scala:33)
	at com.amazonaws.services.glue.GlueContext.getCustomSource(GlueContext.scala:159)
	at com.amazonaws.services.glue.GlueContext.getSourceInternal(GlueContext.scala:910)
	at com.amazonaws.services.glue.GlueContext.getSource(GlueContext.scala:753)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
	- object not serializable (class: org.apache.hadoop.fs.Path, value: s3://olympus-dev-data-refined/clientcontact_hudi_v6)
	- element of array (index: 0)
	- array (class [Ljava.lang.Object;, size 1)
	- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
	- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(s3://olympus-dev-data-refined/clientcontact_hudi_v6))
	- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
	- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@691)
	- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
	- object (class org.apache.spark.scheduler.ResultTask, ResultTask(0, 0))
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
	at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
	at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73)
	at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:81)
	at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:286)
	... 27 more

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
rkkalluricommented, May 1, 2022

@stevenayers you should be able to use the glue catalog to load hudi table like any other hive external table.

See if you can emulate the below for you needs.

Read dataframe from source

input_dyf = glueContext.create_dynamic_frame.from_catalog( database=src_database, table_name=src_table_name, push_down_predicate=f"(sdwh_update_year = ‘{start_date[:4]}’ and sdwh_update_month = ‘{start_date[5:7]}’ and sdwh_update_day = ‘{start_date[8:10]}’)", transformation_ctx=“datasource0”, additional_options={“useS3ListImplementation”: True, “groupFiles”: “inPartition”, “boundedSize”: “6516192768”}, )

1reaction
yihuacommented, May 2, 2022

@rkkalluri Thanks for the help! Closing this issues. @stevenayers feel free to reopen this or file a new issue if you face more problems.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Using the Hudi framework in AWS Glue
You can use AWS Glue to perform read and write operations on Hudi tables in Amazon S3, or work with Hudi tables using...
Read more >
Hudi Merge on Read(MoR) - EMR Workshop
This lab demonstrates using PySpark on Apache Hudi on Amazon EMR to ... Sync the Hudi tables to the Hive/Glue Catalog; Upsert some...
Read more >
Using Athena to query Apache Hudi datasets - 亚马逊云科技
In your CREATE TABLE statement, specify the Hudi table path in your LOCATION clause. ... Using MSCK REPAIR TABLE on Hudi tables in...
Read more >
Hive Connector — Presto 0.278 Documentation
However, Kerberos authentication by ticket cache is not yet supported. ... The Hive Connector can read and write tables that are stored in...
Read more >
EMR Hudi cannot create hive connection jdbc:hive2 ...
I assume you are following the tutorial from AWS documentation. I got it to work using Hudi 0.9.0 by setting hive_sync.mode to hms...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found