Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

See original GitHub issue

Hi All,

I’m currently using AWS Glue Catalog as my Hive Metastore and Glue ETL 2.0 (soon to be 3.0) with Hudi (AWS Hudi Connector 0.9.0 for Glue 1.0 and 2.0).

In Iceberg, you are able to do the following to query the Glue catalog:

df = glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options={
            "path": "my_catalog.my_glue_database.my_iceberg_table",
            "connectionName": "Iceberg Connector for Glue 3.0",
        },
        transformation_ctx="IcebergDyF",
    ).toDF()

I’d like to do something similar with Hudi:

df = glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options= {
            "className": "org.apache.hudi",
            "hoodie.table.name": "my_hudi_table",
            "hoodie.consistency.check.enabled": "true",
            "hoodie.datasource.hive_sync.use_jdbc": "false",
            "hoodie.datasource.hive_sync.database": "my_glue_database",
            "hoodie.datasource.hive_sync.table":  "my_hudi_table",
            "hoodie.datasource.hive_sync.enable": "true",
            "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "hoodie.datasource.hive_sync.partition_fields": "YYYY,MM,DD"
        },
        transformation_ctx="HudiDyF",
    )

Meaning we don’t need to grab the S3 path of our data from boto3 every time, like so:

client = boto3.client('glue')
response = client.get_table( # <<----- don't want this
    DatabaseName='my_glue_database',
    Name='my_hudi_table'
) 
targetPath = response['Table']['StorageDescriptor']['Location'] # <<----- or this
df = glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options= {
            "className": "org.apache.hudi",
            "path": targetPath, # <<----- or this
            "hoodie.table.name": "my_hudi_table",
            "hoodie.consistency.check.enabled": "true",
            "hoodie.datasource.hive_sync.use_jdbc": "false",
            "hoodie.datasource.hive_sync.database": "my_glue_database",
            "hoodie.datasource.hive_sync.table":  "my_hudi_table",
            "hoodie.datasource.hive_sync.enable": "true",
            "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
            "hoodie.datasource.hive_sync.partition_fields": "YYY,MM,DD"
        },
        transformation_ctx="HudiDyF",
    )
# OR
sourceTableDF = spark.read.format('hudi').load(targetPath)

Is there any way to do this? Very new to Hudi, so if my configuration settings are wrong and this is possible, please let me know!

EDIT: It’s worth mentioning that the above, including path, throws the error below; however that may be my configuration?

Py4JJavaError: An error occurred while calling o90.getSource.
: org.apache.hudi.exception.HoodieException: Error fetching partition paths from metadata table
	at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:288)
	at org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:345)
	at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:420)
	at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:214)
	at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:149)
	at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:116)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:67)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
	at com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadSparkDataSource(CustomDataSourceFactory.scala:89)
	at com.amazonaws.services.glue.marketplace.connector.CustomDataSourceFactory$.loadDataSource(CustomDataSourceFactory.scala:33)
	at com.amazonaws.services.glue.GlueContext.getCustomSource(GlueContext.scala:159)
	at com.amazonaws.services.glue.GlueContext.getSourceInternal(GlueContext.scala:910)
	at com.amazonaws.services.glue.GlueContext.getSource(GlueContext.scala:753)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Failed to serialize task 0, not attempting to retry it. Exception during serialization: java.io.NotSerializableException: org.apache.hadoop.fs.Path
Serialization stack:
	- object not serializable (class: org.apache.hadoop.fs.Path, value: s3://olympus-dev-data-refined/clientcontact_hudi_v6)
	- element of array (index: 0)
	- array (class [Ljava.lang.Object;, size 1)
	- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
	- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(s3://olympus-dev-data-refined/clientcontact_hudi_v6))
	- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
	- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition@691)
	- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
	- object (class org.apache.spark.scheduler.ResultTask, ResultTask(0, 0))
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
	at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:361)
	at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
	at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73)
	at org.apache.hudi.metadata.FileSystemBackedTableMetadata.getAllPartitionPaths(FileSystemBackedTableMetadata.java:81)
	at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:286)
	... 27 more

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

rkkalluricommented, May 1, 2022

@stevenayers you should be able to use the glue catalog to load hudi table like any other hive external table.

See if you can emulate the below for you needs.

Read dataframe from source

input_dyf = glueContext.create_dynamic_frame.from_catalog( database=src_database, table_name=src_table_name, push_down_predicate=f"(sdwh_update_year = ‘{start_date[:4]}’ and sdwh_update_month = ‘{start_date[5:7]}’ and sdwh_update_day = ‘{start_date[8:10]}’)", transformation_ctx=“datasource0”, additional_options={“useS3ListImplementation”: True, “groupFiles”: “inPartition”, “boundedSize”: “6516192768”}, )

1reaction

yihuacommented, May 2, 2022

@rkkalluri Thanks for the help! Closing this issues. @stevenayers feel free to reopen this or file a new issue if you face more problems.

Top Results From Across the Web

Using the Hudi framework in AWS Glue

You can use AWS Glue to perform read and write operations on Hudi tables in Amazon S3, or work with Hudi tables using...

Hudi Merge on Read(MoR) - EMR Workshop

This lab demonstrates using PySpark on Apache Hudi on Amazon EMR to ... Sync the Hudi tables to the Hive/Glue Catalog; Upsert some...

Using Athena to query Apache Hudi datasets - 亚马逊云科技

In your CREATE TABLE statement, specify the Hudi table path in your LOCATION clause. ... Using MSCK REPAIR TABLE on Hudi tables in...

Hive Connector — Presto 0.278 Documentation

However, Kerberos authentication by ticket cache is not yet supported. ... The Hive Connector can read and write tables that are stored in...

EMR Hudi cannot create hive connection jdbc:hive2 ...

I assume you are following the tutorial from AWS documentation. I got it to work using Hudi 0.9.0 by setting hive_sync.mode to hms...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[SUPPORT] Read Hudi Table from Hive/Glue Catalog without specifying the S3 Path

Issue Analytics

Top GitHub Comments

Read dataframe from source

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[SUPPORT] Upsert overwrting ordering field with invalid value

Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.