Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

Enter PySpark interactive session with command:

pyspark --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Run Spark SQL query on any kind of table(no matter Hudi or not), eg:

spark.sql('select * from somedb.non_hudi_table')
spark.sql('select * from somedb.hudi_table')

Expected behavior

When I am using select query on a non-hudi table in Spark with Hudi deps, I should get the right datafrarme which includes the data as I selected.

When on an Hudi table, it should return a dataframe with the real data I selected and/or Hudi specific columns.

Environment Description

Hudi version : 0.10.0 (replacement of 0.8.0 bundled in EMR)
Spark version : 3.1.2
Hive version : 3.1.2
Hadoop version : 3.2.1
Storage (HDFS/S3/GCS…) :S3
Running on Docker? (yes/no) : no

Additional context

About Hudi

This occurs in AWS EWR 6.4.0, which Hudi 0.8 bundled is replaced with Hudi 0.10.

The replacement action is as follows:

Downlaod packages

hudi-spark3-bundle_2.12-0.10.0.jar 
hudi-hadoop-mr-bundle-0.10.0.jar  
hudi-utilities-bundle_2.12-0.10.0.jar  
hudi-hive-sync-bundle-0.10.0.jar 
hudi-presto-bundle-0.10.0.jar  
hudi-timeline-server-bundle-0.10.0.jar  
hudi-cli-0.10.0.jar  
hudi-client-common-0.10.0.jar  
hudi-common-0.10.0.jar  
hudi-hadoop-mr-0.10.0.jar  
hudi-hive-sync-0.10.0.jar  
hudi-spark3_2.12-0.10.0.jar 
hudi-spark-client-0.10.0.jar 
hudi-spark-common_2.12-0.10.0.jar 
hudi-sync-common-0.10.0.jar 
hudi-timeline-service-0.10.0.jar  
hudi-utilities_2.12-0.10.0.jar  
hudi-utilities-bundle_2.12-0.10.0.jar

Replace Hudi 0.8 Replace Hudi 0.8(the packages as downloaded but 0.8 version) in /usr/lib/hudi/ with the packages above.
Now can try Spark SQL with Hudi

spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

About Catalog In this case, I am using AWS Glue as the catalog.
About Spark This occurs ONLY in PySpark. For Spark Scala interactive session, Spark SQL query such as select and update, delete with Hudi works just as fine as presented in the documentation.
About Hudi 0.8 If I use Hudi 0.8 with Hadoop, Spark, Hive as the same version mentioned(also EMR), when entering pyspark session, spark sql will execute correctly for normal tables.

Stacktrace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: java.lang.NoSuchMethodError: com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;
	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:39)
	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:29)
	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:118)
	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
	at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1734)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1454)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1369)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
	at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
	at com.amazonaws.services.glue.AWSGlueClient.executeGetDatabase(AWSGlueClient.java:4466)
	at com.amazonaws.services.glue.AWSGlueClient.getDatabase(AWSGlueClient.java:4435)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.doesDefaultDBExist(AWSCatalogMetastoreClient.java:238)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init>(AWSCatalogMetastoreClient.java:151)
	at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:20)
	at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:507)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726)
	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988)
	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251)
	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234)
	at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402)
	at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335)
	at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315)
	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291)
	at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:257)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
	at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:384)
	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:135)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:125)
	at org.apache.spark.sql.internal.SharedState.isDatabaseExistent$1(SharedState.scala:169)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:201)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:153)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:52)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:99)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:99)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupGlobalTempView(SessionCatalog.scala:870)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveTempViews$$lookupTempView(Analyzer.scala:916)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveTempViews$$lookupAndResolveTempView(Analyzer.scala:930)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:875)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:873)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:75)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$2(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:388)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:424)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:256)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:422)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:370)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.apply(Analyzer.scala:873)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:1112)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:1077)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:220)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:217)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:290)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:333)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:290)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:280)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:280)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:192)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:163)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

2reactions

a0xcommented, Jan 5, 2022

Finally I fixed this problem by removing aws deps in packing/hudi-spark-bundle/pom.xml and recompiling it myself.

<!-- line 106, keep it as comment -->
<!-- <include>com.amazonaws:dynamodb-lock-client</include> -->
<!-- <include>com.amazonaws:aws-java-sdk-dynamodb</include> -->
<!-- <include>com.amazonaws:aws-java-sdk-core</include> -->

1reaction

a0xcommented, Jan 4, 2022

@kazdy I did recompile Hudi packages as the mentioned config, yet the error remains.

This is an interesting problem, because all things good in spark-shell, yet the problem occues only in PySpark.

So I think the lib confliction is hidden in the diff between spark-shell and pyspark.

Top Results From Across the Web

[GitHub] [hudi] kazdy edited a comment on issue #4442: [SUPPORT ...

[GitHub] [hudi] kazdy edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql.

Spark Guide - Apache Hudi

This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through.

Work with a Hudi dataset - Amazon EMR - AWS Documentation

Hudi supports inserting, updating, and deleting data in Hudi datasets through Spark. ... To use the PySpark shell, replace spark-shell with pyspark ....

Amazon EMR what's new history

With EMR 6.5, you can use Apache Spark 3.1.2 with the Iceberg table format. Apache Hudi 0.9 adds Spark SQL DDL and DML...

Newest 'apache-hudi' Questions - Stack Overflow

Spark 3.1.1 Python 3.8.13 Debian 5.10.127 x86_64 launch code: pyspark --. ... Error to write hudi table into minio s3 bucket by flink...