Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hive: create and write iceberg by hive catalog using Spark, Hive client read no data

See original GitHub issue

Environment

spark: 3.0.0
hive: 2.3.7
iceberg: 0.10.0

SparkSession configuration

    val spark = SparkSession
      .builder()
      .master("local[2]")
      .appName("IcebergAPI")
      .config("spark.sql.catalog.hive_prod", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.hive_prod.type", "hive")
      .config("spark.sql.catalog.hive_prod.uri", "thrift://localhost:9083")
      .enableHiveSupport()
      .getOrCreate()

Create database db by hive client

➜  bin ./beeline
beeline> !connect jdbc:hive2://localhost:10000 hive hive
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.7)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> create database db;
No rows affected (0.105 seconds)

Create iceberg table by hiveCatalog using Spark (Link: https://iceberg.apache.org/hive/#using-hive-catalog)

  def createByHiveCatalog(spark: SparkSession): Unit = {
    val hadoopConfiguration = spark.sparkContext.hadoopConfiguration
    hadoopConfiguration.set(org.apache.iceberg.hadoop.ConfigProperties.ENGINE_HIVE_ENABLED, "true"); //iceberg.engine.hive.enabled=true
    val hiveCatalog = new HiveCatalog(hadoopConfiguration)
    val nameSpace = Namespace.of("db")
    val tableIdentifier: TableIdentifier = TableIdentifier.of(nameSpace, "tb")

    val columns: List[Types.NestedField] = new ArrayList[Types.NestedField]
    columns.add(Types.NestedField.of(1, true, "id", Types.IntegerType.get, "id doc"))
    columns.add(Types.NestedField.of(2, true, "ts", Types.TimestampType.withZone(), "ts doc"))

    val schema: Schema = new Schema(columns)
    val partition = PartitionSpec.builderFor(schema).year("ts").build()

    hiveCatalog.createTable(tableIdentifier, schema, partition)
  }

Query iceberg table by hive client

0: jdbc:hive2://localhost:10000> add jar /Users/dovezhang/software/idea/github/iceberg/hive-runtime/build/libs/iceberg-hive-runtime-0.10.0.jar;
No rows affected (0.043 seconds)
0: jdbc:hive2://localhost:10000> set iceberg.mr.catalog=hive;
No rows affected (0.003 seconds)
0: jdbc:hive2://localhost:10000> select * from db.tb;
+--------+--------+
| tb.id  | tb.ts  |
+--------+--------+
+--------+--------+
No rows selected (1.166 seconds)

Write data by hive Catalog using Spark

  case class dbtb(id: Int, time: Timestamp)
  def writeDataToIcebergHive(spark: SparkSession): Unit = {
    val seq = Seq(dbtb(1, Timestamp.valueOf("2020-07-06 13:40:00")),
      dbtb(2, Timestamp.valueOf("2020-07-06 14:30:00")),
      dbtb(3, Timestamp.valueOf("2020-07-06 15:20:00")))
    val df = spark.createDataFrame(seq).toDF("id", "ts")

    import org.apache.spark.sql.functions
    df.writeTo(s"hive_prod.db.tb").overwrite(functions.lit(true))
  }

Query iceberg table by hive client

0: jdbc:hive2://localhost:10000> select * from db.tb;
+--------+--------+
| tb.id  | tb.ts  |
+--------+--------+
+--------+--------+
No rows selected (0.152 seconds)

After writing the data, no data is returned via Hive Client.

Query iceberg table by hive catalog using Spark

  def readIcebergByHiveCatalog(spark: SparkSession): Unit = {
    spark.sql("select * from hive_prod.db.tb").show(false)
  }

Result

+---+-------------------+
|id |ts                 |
+---+-------------------+
|1  |2020-07-06 13:40:00|
|2  |2020-07-06 14:30:00|
|3  |2020-07-06 15:20:00|
+---+-------------------+

Check the table of contents for data files

➜  bin hdfs dfs -ls /usr/hive/warehouse/db.db/tb/data/ts_year=2020
20/11/26 15:16:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 dovezhang supergroup        656 2020-11-26 15:11 /usr/hive/warehouse/db.db/tb/data/ts_year=2020/00000-0-2b98be41-8347-4a8c-a986-d28878ab7a67-00001.parquet
-rw-r--r--   1 dovezhang supergroup        664 2020-11-26 15:11 /usr/hive/warehouse/db.db/tb/data/ts_year=2020/00001-1-b192e846-5a6a-4ee9-b31a-7e5fcf813b88-00001.parquet

I am not sure why the Hive Client cannot see data after Spark builds tables and writes data. Have anyone know it?

Issue Analytics

State:
Created 3 years ago
Comments:12 (11 by maintainers)

Top GitHub Comments

1reaction

GintokiYscommented, Dec 9, 2020

The property engine.hive.enabled needs to be set to true and added to the table properties when creating the Iceberg table.

Map<String, String> tableProperties = new HashMap<String, String>(); 
tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); //engine.hive.enabled=true 
catalog.createTable(tableId, schema, spec, tableProperties);

You may can refer to the Using Hive Catalog module of this article. https://iceberg.apache.org/hive/#using-hive-catalog

1reaction

pvarycommented, Nov 26, 2020

@zhangdove: Could you please share the HiveServer2 logs for the SELECT * FROM db.tb query? Maybe even DEBUG level logs, if possible. Also the results of the DESCRIBE FORMATTED db.tb command from BeeLine might help? Thanks, Peter

Top Results From Across the Web

Hive - Apache Iceberg

Iceberg supports reading and writing Iceberg tables through Hive by using a ... data catalog types such as Hive, Hadoop, AWS Glue, or...

[GitHub] [iceberg] GintokiYs commented on issue #1831: Hive ...

[GitHub] [iceberg] GintokiYs commented on issue #1831: Hive: create and write iceberg by hive catalog using Spark, Hive client read no data.

Migrating a Hive Table to an Iceberg Table Hands-on Tutorial

Spinning up a Docker container with Spark; Creating Hive tables; Migrating the Hive table to an Iceberg table without restating the data ......

Getting error when querying iceberg table via Spark thrift ...

If you create a table via Spark/Beeline and you can see that table, but not the table that ... In spark-default.conf set spark.sql.hive....

Support Iceberg tables with Dataproc Metastore - Google Cloud

Note: Writing an Iceberg table on Hive is not supported. Instead, you can create an Iceberg table, insert data on Spark, and then...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Hive: create and write iceberg by hive catalog using Spark, Hive client read no data

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Flink Iceberg Usage

When using hiveCatalog.dropTable(identifier, true), the table directory is not completely removed