question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Hive: create and write iceberg by hive catalog using Spark, Hive client read no data

See original GitHub issue
  1. Environment
spark: 3.0.0
hive: 2.3.7
iceberg: 0.10.0
  1. SparkSession configuration
    val spark = SparkSession
      .builder()
      .master("local[2]")
      .appName("IcebergAPI")
      .config("spark.sql.catalog.hive_prod", "org.apache.iceberg.spark.SparkCatalog")
      .config("spark.sql.catalog.hive_prod.type", "hive")
      .config("spark.sql.catalog.hive_prod.uri", "thrift://localhost:9083")
      .enableHiveSupport()
      .getOrCreate()
  1. Create database db by hive client
➜  bin ./beeline
beeline> !connect jdbc:hive2://localhost:10000 hive hive
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.7)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> create database db;
No rows affected (0.105 seconds)
  1. Create iceberg table by hiveCatalog using Spark (Link: https://iceberg.apache.org/hive/#using-hive-catalog)
  def createByHiveCatalog(spark: SparkSession): Unit = {
    val hadoopConfiguration = spark.sparkContext.hadoopConfiguration
    hadoopConfiguration.set(org.apache.iceberg.hadoop.ConfigProperties.ENGINE_HIVE_ENABLED, "true"); //iceberg.engine.hive.enabled=true
    val hiveCatalog = new HiveCatalog(hadoopConfiguration)
    val nameSpace = Namespace.of("db")
    val tableIdentifier: TableIdentifier = TableIdentifier.of(nameSpace, "tb")

    val columns: List[Types.NestedField] = new ArrayList[Types.NestedField]
    columns.add(Types.NestedField.of(1, true, "id", Types.IntegerType.get, "id doc"))
    columns.add(Types.NestedField.of(2, true, "ts", Types.TimestampType.withZone(), "ts doc"))

    val schema: Schema = new Schema(columns)
    val partition = PartitionSpec.builderFor(schema).year("ts").build()

    hiveCatalog.createTable(tableIdentifier, schema, partition)
  }
  1. Query iceberg table by hive client
0: jdbc:hive2://localhost:10000> add jar /Users/dovezhang/software/idea/github/iceberg/hive-runtime/build/libs/iceberg-hive-runtime-0.10.0.jar;
No rows affected (0.043 seconds)
0: jdbc:hive2://localhost:10000> set iceberg.mr.catalog=hive;
No rows affected (0.003 seconds)
0: jdbc:hive2://localhost:10000> select * from db.tb;
+--------+--------+
| tb.id  | tb.ts  |
+--------+--------+
+--------+--------+
No rows selected (1.166 seconds)
  1. Write data by hive Catalog using Spark
  case class dbtb(id: Int, time: Timestamp)
  def writeDataToIcebergHive(spark: SparkSession): Unit = {
    val seq = Seq(dbtb(1, Timestamp.valueOf("2020-07-06 13:40:00")),
      dbtb(2, Timestamp.valueOf("2020-07-06 14:30:00")),
      dbtb(3, Timestamp.valueOf("2020-07-06 15:20:00")))
    val df = spark.createDataFrame(seq).toDF("id", "ts")

    import org.apache.spark.sql.functions
    df.writeTo(s"hive_prod.db.tb").overwrite(functions.lit(true))
  }
  1. Query iceberg table by hive client
0: jdbc:hive2://localhost:10000> select * from db.tb;
+--------+--------+
| tb.id  | tb.ts  |
+--------+--------+
+--------+--------+
No rows selected (0.152 seconds)

After writing the data, no data is returned via Hive Client.

  1. Query iceberg table by hive catalog using Spark
  def readIcebergByHiveCatalog(spark: SparkSession): Unit = {
    spark.sql("select * from hive_prod.db.tb").show(false)
  }

Result

+---+-------------------+
|id |ts                 |
+---+-------------------+
|1  |2020-07-06 13:40:00|
|2  |2020-07-06 14:30:00|
|3  |2020-07-06 15:20:00|
+---+-------------------+
  1. Check the table of contents for data files
➜  bin hdfs dfs -ls /usr/hive/warehouse/db.db/tb/data/ts_year=2020
20/11/26 15:16:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 dovezhang supergroup        656 2020-11-26 15:11 /usr/hive/warehouse/db.db/tb/data/ts_year=2020/00000-0-2b98be41-8347-4a8c-a986-d28878ab7a67-00001.parquet
-rw-r--r--   1 dovezhang supergroup        664 2020-11-26 15:11 /usr/hive/warehouse/db.db/tb/data/ts_year=2020/00001-1-b192e846-5a6a-4ee9-b31a-7e5fcf813b88-00001.parquet

I am not sure why the Hive Client cannot see data after Spark builds tables and writes data. Have anyone know it?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
GintokiYscommented, Dec 9, 2020

The property engine.hive.enabled needs to be set to true and added to the table properties when creating the Iceberg table.

Map<String, String> tableProperties = new HashMap<String, String>(); 
tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); //engine.hive.enabled=true 
catalog.createTable(tableId, schema, spec, tableProperties);

You may can refer to the Using Hive Catalog module of this article. https://iceberg.apache.org/hive/#using-hive-catalog

1reaction
pvarycommented, Nov 26, 2020

@zhangdove: Could you please share the HiveServer2 logs for the SELECT * FROM db.tb query? Maybe even DEBUG level logs, if possible. Also the results of the DESCRIBE FORMATTED db.tb command from BeeLine might help? Thanks, Peter

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hive - Apache Iceberg
Iceberg supports reading and writing Iceberg tables through Hive by using a ... data catalog types such as Hive, Hadoop, AWS Glue, or...
Read more >
[GitHub] [iceberg] GintokiYs commented on issue #1831: Hive ...
[GitHub] [iceberg] GintokiYs commented on issue #1831: Hive: create and write iceberg by hive catalog using Spark, Hive client read no data.
Read more >
Migrating a Hive Table to an Iceberg Table Hands-on Tutorial
Spinning up a Docker container with Spark; Creating Hive tables; Migrating the Hive table to an Iceberg table without restating the data ......
Read more >
Getting error when querying iceberg table via Spark thrift ...
If you create a table via Spark/Beeline and you can see that table, but not the table that ... In spark-default.conf set spark.sql.hive....
Read more >
Support Iceberg tables with Dataproc Metastore - Google Cloud
Note: Writing an Iceberg table on Hive is not supported. Instead, you can create an Iceberg table, insert data on Spark, and then...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found