question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. Enter PySpark interactive session with command:
pyspark --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  1. Run Spark SQL query on any kind of table(no matter Hudi or not), eg:
spark.sql('select * from somedb.non_hudi_table')
spark.sql('select * from somedb.hudi_table')

Expected behavior

When I am using select query on a non-hudi table in Spark with Hudi deps, I should get the right datafrarme which includes the data as I selected.

When on an Hudi table, it should return a dataframe with the real data I selected and/or Hudi specific columns.

Environment Description

  • Hudi version : 0.10.0 (replacement of 0.8.0 bundled in EMR)

  • Spark version : 3.1.2

  • Hive version : 3.1.2

  • Hadoop version : 3.2.1

  • Storage (HDFS/S3/GCS…) :S3

  • Running on Docker? (yes/no) : no

Additional context

  • About Hudi

This occurs in AWS EWR 6.4.0, which Hudi 0.8 bundled is replaced with Hudi 0.10.

The replacement action is as follows:

  1. Downlaod packages
hudi-spark3-bundle_2.12-0.10.0.jar 
hudi-hadoop-mr-bundle-0.10.0.jar  
hudi-utilities-bundle_2.12-0.10.0.jar  
hudi-hive-sync-bundle-0.10.0.jar 
hudi-presto-bundle-0.10.0.jar  
hudi-timeline-server-bundle-0.10.0.jar  
hudi-cli-0.10.0.jar  
hudi-client-common-0.10.0.jar  
hudi-common-0.10.0.jar  
hudi-hadoop-mr-0.10.0.jar  
hudi-hive-sync-0.10.0.jar  
hudi-spark3_2.12-0.10.0.jar 
hudi-spark-client-0.10.0.jar 
hudi-spark-common_2.12-0.10.0.jar 
hudi-sync-common-0.10.0.jar 
hudi-timeline-service-0.10.0.jar  
hudi-utilities_2.12-0.10.0.jar  
hudi-utilities-bundle_2.12-0.10.0.jar  
  1. Replace Hudi 0.8 Replace Hudi 0.8(the packages as downloaded but 0.8 version) in /usr/lib/hudi/ with the packages above.
  2. Now can try Spark SQL with Hudi
spark-sql --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  • About Catalog In this case, I am using AWS Glue as the catalog.

  • About Spark This occurs ONLY in PySpark. For Spark Scala interactive session, Spark SQL query such as select and update, delete with Hudi works just as fine as presented in the documentation.

  • About Hudi 0.8 If I use Hudi 0.8 with Hadoop, Spark, Hive as the same version mentioned(also EMR), when entering pyspark session, spark sql will execute correctly for normal tables.

Stacktrace

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o67.sql.
: java.lang.NoSuchMethodError: com.amazonaws.transform.JsonUnmarshallerContext.getCurrentToken()Lcom/amazonaws/thirdparty/jackson/core/JsonToken;
	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:39)
	at com.amazonaws.services.glue.model.transform.GetDatabaseResultJsonUnmarshaller.unmarshall(GetDatabaseResultJsonUnmarshaller.java:29)
	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:118)
	at com.amazonaws.http.JsonResponseHandler.handle(JsonResponseHandler.java:43)
	at com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:69)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1734)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleSuccessResponse(AmazonHttpClient.java:1454)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1369)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
	at com.amazonaws.services.glue.AWSGlueClient.doInvoke(AWSGlueClient.java:10640)
	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10607)
	at com.amazonaws.services.glue.AWSGlueClient.invoke(AWSGlueClient.java:10596)
	at com.amazonaws.services.glue.AWSGlueClient.executeGetDatabase(AWSGlueClient.java:4466)
	at com.amazonaws.services.glue.AWSGlueClient.getDatabase(AWSGlueClient.java:4435)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.doesDefaultDBExist(AWSCatalogMetastoreClient.java:238)
	at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init>(AWSCatalogMetastoreClient.java:151)
	at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:20)
	at org.apache.hadoop.hive.ql.metadata.HiveUtils.createMetaStoreClient(HiveUtils.java:507)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3746)
	at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3726)
	at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3988)
	at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:251)
	at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:234)
	at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:402)
	at org.apache.hadoop.hive.ql.metadata.Hive.create(Hive.java:335)
	at org.apache.hadoop.hive.ql.metadata.Hive.getInternal(Hive.java:315)
	at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:291)
	at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:257)
	at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:283)
	at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
	at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
	at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
	at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:384)
	at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
	at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
	at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:135)
	at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:125)
	at org.apache.spark.sql.internal.SharedState.isDatabaseExistent$1(SharedState.scala:169)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:201)
	at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:153)
	at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:52)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:99)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:99)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupGlobalTempView(SessionCatalog.scala:870)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveTempViews$$lookupTempView(Analyzer.scala:916)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveTempViews$$lookupAndResolveTempView(Analyzer.scala:930)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:875)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:873)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:75)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:90)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$2(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.trees.TreeNode.applyFunctionIfChanged$1(TreeNode.scala:388)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:424)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:256)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:422)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:370)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$1(AnalysisHelper.scala:87)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp(AnalysisHelper.scala:86)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsUp$(AnalysisHelper.scala:84)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$.apply(Analyzer.scala:873)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:1112)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:1077)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:220)
	at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
	at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
	at scala.collection.immutable.List.foldLeft(List.scala:89)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:217)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$6(RuleExecutor.scala:290)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor$RuleExecutionContext$.withContext(RuleExecutor.scala:333)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5(RuleExecutor.scala:290)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$5$adapted(RuleExecutor.scala:280)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:280)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:192)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:192)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:163)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
	at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
a0xcommented, Jan 5, 2022

Finally I fixed this problem by removing aws deps in packing/hudi-spark-bundle/pom.xml and recompiling it myself.

<!-- line 106, keep it as comment -->
<!-- <include>com.amazonaws:dynamodb-lock-client</include> -->
<!-- <include>com.amazonaws:aws-java-sdk-dynamodb</include> -->
<!-- <include>com.amazonaws:aws-java-sdk-core</include> -->
1reaction
a0xcommented, Jan 4, 2022

@kazdy I did recompile Hudi packages as the mentioned config, yet the error remains.

This is an interesting problem, because all things good in spark-shell, yet the problem occues only in PySpark.

So I think the lib confliction is hidden in the diff between spark-shell and pyspark.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [hudi] kazdy edited a comment on issue #4442: [SUPPORT ...
[GitHub] [hudi] kazdy edited a comment on issue #4442: [SUPPORT] PySpark(3.1.2) with Hudi(0.10.0) failed when querying spark sql.
Read more >
Spark Guide - Apache Hudi
This guide provides a quick peek at Hudi's capabilities using spark-shell. Using Spark datasources, we will walk through.
Read more >
Work with a Hudi dataset - Amazon EMR - AWS Documentation
Hudi supports inserting, updating, and deleting data in Hudi datasets through Spark. ... To use the PySpark shell, replace spark-shell with pyspark ....
Read more >
Amazon EMR what's new history
With EMR 6.5, you can use Apache Spark 3.1.2 with the Iceberg table format. Apache Hudi 0.9 adds Spark SQL DDL and DML...
Read more >
Newest 'apache-hudi' Questions - Stack Overflow
Spark 3.1.1 Python 3.8.13 Debian 5.10.127 x86_64 launch code: pyspark --. ... Error to write hudi table into minio s3 bucket by flink...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found