question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Read hive Table fail when HoodieCatalog used

See original GitHub issue

hudi 0.11.0 spark 3.2.1

when hive_sync then read.table("table_name") raise an error pyspark.sql.utils.AnalysisException: Table does not support reads. The error does’t raise when --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog' is not set.

pyspark   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'   --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

sc.setLogLevel("WARN")
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
    dataGen.generateInserts(10)
)
from pyspark.sql.functions import expr

df = spark.read.json(spark.sparkContext.parallelize(inserts, 10)).withColumn(
    "part", expr("'foo'")
)
tableName = "test_hudi_pyspark"
basePath = f"/tmp/{tableName}"

hudi_options = {
    "hoodie.table.name": tableName,
    "hoodie.datasource.write.recordkey.field": "uuid",
    "hoodie.datasource.write.partitionpath.field": "part",
    "hoodie.datasource.write.table.name": tableName,
    "hoodie.datasource.write.operation": "upsert",
    "hoodie.datasource.write.precombine.field": "ts",
    "hoodie.upsert.shuffle.parallelism": 2,
    "hoodie.insert.shuffle.parallelism": 2,
    "hoodie.datasource.hive_sync.database": "default",
    "hoodie.datasource.hive_sync.table": tableName,
    "hoodie.datasource.hive_sync.mode": "hms",
    "hoodie.datasource.hive_sync.enable": "true",
    "hoodie.datasource.hive_sync.partition_fields": "part",
    "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
}
(df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath))
spark.read.format("hudi").load(basePath).count() # WORKS 
spark.table("default.test_hudi_pyspark").count() # RAISE AN ERROR !!

ERROR: pyspark.sql.utils.AnalysisException: Table does not support reads: default.test_hudi_pyspark

I debugged it a bit and the hudi catalog for load table uses the super.loadTable which is not aware of hudi ?

  override def loadTable(ident: Identifier): Table = {
    try {
      super.loadTable(ident) match {
        case v1: V1Table if sparkAdapter.isHoodieTable(v1.catalogTable) =>
          HoodieInternalV2Table(
            spark,
            v1.catalogTable.location.toString,
            catalogTable = Some(v1.catalogTable),
            tableIdentifier = Some(ident.toString))
        case o => o // this case is used
      }
    } catch {
      case e: Exception =>
        throw e
    }
  }

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
nsivabalancommented, May 12, 2022

hmmm, interesting. @XuQianJin-Stars : Can you assist here please.

0reactions
leesfcommented, Jun 3, 2022

Closing the issue, @parisni please reopen if you have new problems.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error while running query on HIVE; - Cloudera Community
Solved: HI All, I am unable to run the simple query on HIVE i.e. describe ... Can you please check database name, which...
Read more >
spark throws error when reading hive transaction table
The issue you are trying to reading Transactional table (transactional = true) into Spark. Officially Spark not yet supported for Hive-ACID ...
Read more >
Hive Tables - Spark 3.3.1 Documentation
Specifying storage format for Hive tables; Interacting with Different Versions of Hive Metastore. Spark SQL also supports reading and writing data stored in ......
Read more >
Why Cannot I Query Newly Inserted Data in a Parquet Hive ...
Why cannot I query newly inserted data in a parquet Hive table using SparkSQL? This problem occurs in the following scenarios:For partitioned tables...
Read more >
Integrating Apache Hive Metastores with Snowflake
Supported Hive Operations and Table Types. Hive and Snowflake Data Types. Supported File Formats and Options. Unsupported Hive Commands, Features, and Use ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found