question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Forbid setting "uri" for SparkSessionCatalog when Hive is used.

See original GitHub issue

The workaround for this issue is to just set hive.metastore.uris when using the “spark_catalog” do not set the “uri” parameter when using an Iceberg Hive based SparkSessionCatalog

Our current configuration of the Spark3 Spark session catalog allows you to set the value of the metastore either by inheriting it from the Hadoop Configuration

spark.hadoop.hive.metastore.uris=thrift://localhost:9083

or by specifying it for the catalog itself

spark.sql.catalog.spark_catalog.uri=thrift://localhost:9083 

Or a user can use a non Hive based catalog for the Session or Iceberg table. The key issue here is that if these catalogs differ we can end up with a lot of weird situations.

For example: Say we configure only “spark_catalog.uri”, This will set the Iceberg metastore to a value but leave the Spark Session catalog on it’s default value (in my local case derby). This means that almost all calls to database will be done on derby and invisible to Iceberg. So I can end up with weird behavior like

scala> spark.sql("CREATE DATABASE catset")
21/04/16 10:18:56 WARN ObjectStore: Failed to get database catset, returning NoSuchObjectException
res4: org.apache.spark.sql.DataFrame = []
scala> spark.sql("CREATE TABLE catset.foo (x int) USING iceberg")
java.lang.RuntimeException: Metastore operation failed for catset.foo

I have no problem making the database, but my CREATE command uses the Iceberg catalog which doesn’t have the the database. So I get the “catset” not exists error. I

I think to address this we need to disallow configuring the sparksession catalog with a different catalog type than the Iceberg catalog. This means we only actually allow for the sparksession catalog to be coupled with a hive metastore which is also configured for the delegate session catalog.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
RussellSpitzercommented, Sep 13, 2021

As recently discussed with @rdblue and @flyrain I think we have consensus that we don’t want to allow having hive.url set when using the SparkSessionCatalog class since this is always incorrect. Gonna tag this with Good First Issue since it should be a relatively small task.

0reactions
itachi-sharingancommented, Sep 15, 2021

@RussellSpitzer thanks a lot for the clarification, can you please review the pr.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuration - Spark 3.3.1 Documentation - Apache Spark
SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through...
Read more >
Metastore configuration - Amazon EMR - AWS Documentation
A Hive metastore is a centralized location that stores structural information about your tables, including schemas, partition names, and data types.
Read more >
How to connect Spark SQL to remote Hive metastore (via thrift ...
I'm using HiveContext with SparkSQL and I'm trying to connect to a remote Hive metastore, the only way to set the hive metastore...
Read more >
External Apache Hive metastore | Databricks on AWS
Learn how to connect to external Apache Hive metastores in Databricks.
Read more >
Multiple SparkSession for one SparkContext - Waiting For Code
In order to work with Hive we needed to use HiveContext. ... Apache Spark provides a factory method getOrCreate() to prevent against ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found