Forbid setting "uri" for SparkSessionCatalog when Hive is used.
See original GitHub issueThe workaround for this issue is to just set hive.metastore.uris when using the “spark_catalog” do not set the “uri” parameter when using an Iceberg Hive based SparkSessionCatalog
Our current configuration of the Spark3 Spark session catalog allows you to set the value of the metastore either by inheriting it from the Hadoop Configuration
spark.hadoop.hive.metastore.uris=thrift://localhost:9083
or by specifying it for the catalog itself
spark.sql.catalog.spark_catalog.uri=thrift://localhost:9083
Or a user can use a non Hive based catalog for the Session or Iceberg table. The key issue here is that if these catalogs differ we can end up with a lot of weird situations.
For example: Say we configure only “spark_catalog.uri”, This will set the Iceberg metastore to a value but leave the Spark Session catalog on it’s default value (in my local case derby). This means that almost all calls to database will be done on derby and invisible to Iceberg. So I can end up with weird behavior like
scala> spark.sql("CREATE DATABASE catset")
21/04/16 10:18:56 WARN ObjectStore: Failed to get database catset, returning NoSuchObjectException
res4: org.apache.spark.sql.DataFrame = []
scala> spark.sql("CREATE TABLE catset.foo (x int) USING iceberg")
java.lang.RuntimeException: Metastore operation failed for catset.foo
I have no problem making the database, but my CREATE command uses the Iceberg catalog which doesn’t have the the database. So I get the “catset” not exists error. I
I think to address this we need to disallow configuring the sparksession catalog with a different catalog type than the Iceberg catalog. This means we only actually allow for the sparksession catalog to be coupled with a hive metastore which is also configured for the delegate session catalog.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (5 by maintainers)
As recently discussed with @rdblue and @flyrain I think we have consensus that we don’t want to allow having
hive.url
set when using the SparkSessionCatalog class since this is always incorrect. Gonna tag this with Good First Issue since it should be a relatively small task.@RussellSpitzer thanks a lot for the clarification, can you please review the pr.