Hive: create and write iceberg by hive catalog using Spark, Hive client read no data
See original GitHub issue- Environment
spark: 3.0.0
hive: 2.3.7
iceberg: 0.10.0
- SparkSession configuration
val spark = SparkSession
.builder()
.master("local[2]")
.appName("IcebergAPI")
.config("spark.sql.catalog.hive_prod", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.hive_prod.type", "hive")
.config("spark.sql.catalog.hive_prod.uri", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate()
- Create database
db
by hive client
➜ bin ./beeline
beeline> !connect jdbc:hive2://localhost:10000 hive hive
Connecting to jdbc:hive2://localhost:10000
Connected to: Apache Hive (version 2.3.7)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> create database db;
No rows affected (0.105 seconds)
- Create iceberg table by hiveCatalog using Spark (Link: https://iceberg.apache.org/hive/#using-hive-catalog)
def createByHiveCatalog(spark: SparkSession): Unit = {
val hadoopConfiguration = spark.sparkContext.hadoopConfiguration
hadoopConfiguration.set(org.apache.iceberg.hadoop.ConfigProperties.ENGINE_HIVE_ENABLED, "true"); //iceberg.engine.hive.enabled=true
val hiveCatalog = new HiveCatalog(hadoopConfiguration)
val nameSpace = Namespace.of("db")
val tableIdentifier: TableIdentifier = TableIdentifier.of(nameSpace, "tb")
val columns: List[Types.NestedField] = new ArrayList[Types.NestedField]
columns.add(Types.NestedField.of(1, true, "id", Types.IntegerType.get, "id doc"))
columns.add(Types.NestedField.of(2, true, "ts", Types.TimestampType.withZone(), "ts doc"))
val schema: Schema = new Schema(columns)
val partition = PartitionSpec.builderFor(schema).year("ts").build()
hiveCatalog.createTable(tableIdentifier, schema, partition)
}
- Query iceberg table by hive client
0: jdbc:hive2://localhost:10000> add jar /Users/dovezhang/software/idea/github/iceberg/hive-runtime/build/libs/iceberg-hive-runtime-0.10.0.jar;
No rows affected (0.043 seconds)
0: jdbc:hive2://localhost:10000> set iceberg.mr.catalog=hive;
No rows affected (0.003 seconds)
0: jdbc:hive2://localhost:10000> select * from db.tb;
+--------+--------+
| tb.id | tb.ts |
+--------+--------+
+--------+--------+
No rows selected (1.166 seconds)
- Write data by hive Catalog using Spark
case class dbtb(id: Int, time: Timestamp)
def writeDataToIcebergHive(spark: SparkSession): Unit = {
val seq = Seq(dbtb(1, Timestamp.valueOf("2020-07-06 13:40:00")),
dbtb(2, Timestamp.valueOf("2020-07-06 14:30:00")),
dbtb(3, Timestamp.valueOf("2020-07-06 15:20:00")))
val df = spark.createDataFrame(seq).toDF("id", "ts")
import org.apache.spark.sql.functions
df.writeTo(s"hive_prod.db.tb").overwrite(functions.lit(true))
}
- Query iceberg table by hive client
0: jdbc:hive2://localhost:10000> select * from db.tb;
+--------+--------+
| tb.id | tb.ts |
+--------+--------+
+--------+--------+
No rows selected (0.152 seconds)
After writing the data, no data is returned via Hive Client.
- Query iceberg table by hive catalog using Spark
def readIcebergByHiveCatalog(spark: SparkSession): Unit = {
spark.sql("select * from hive_prod.db.tb").show(false)
}
Result
+---+-------------------+
|id |ts |
+---+-------------------+
|1 |2020-07-06 13:40:00|
|2 |2020-07-06 14:30:00|
|3 |2020-07-06 15:20:00|
+---+-------------------+
- Check the table of contents for data files
➜ bin hdfs dfs -ls /usr/hive/warehouse/db.db/tb/data/ts_year=2020
20/11/26 15:16:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 dovezhang supergroup 656 2020-11-26 15:11 /usr/hive/warehouse/db.db/tb/data/ts_year=2020/00000-0-2b98be41-8347-4a8c-a986-d28878ab7a67-00001.parquet
-rw-r--r-- 1 dovezhang supergroup 664 2020-11-26 15:11 /usr/hive/warehouse/db.db/tb/data/ts_year=2020/00001-1-b192e846-5a6a-4ee9-b31a-7e5fcf813b88-00001.parquet
I am not sure why the Hive Client cannot see data after Spark builds tables and writes data. Have anyone know it?
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (11 by maintainers)
Top Results From Across the Web
Hive - Apache Iceberg
Iceberg supports reading and writing Iceberg tables through Hive by using a ... data catalog types such as Hive, Hadoop, AWS Glue, or...
Read more >[GitHub] [iceberg] GintokiYs commented on issue #1831: Hive ...
[GitHub] [iceberg] GintokiYs commented on issue #1831: Hive: create and write iceberg by hive catalog using Spark, Hive client read no data.
Read more >Migrating a Hive Table to an Iceberg Table Hands-on Tutorial
Spinning up a Docker container with Spark; Creating Hive tables; Migrating the Hive table to an Iceberg table without restating the data ......
Read more >Getting error when querying iceberg table via Spark thrift ...
If you create a table via Spark/Beeline and you can see that table, but not the table that ... In spark-default.conf set spark.sql.hive....
Read more >Support Iceberg tables with Dataproc Metastore - Google Cloud
Note: Writing an Iceberg table on Hive is not supported. Instead, you can create an Iceberg table, insert data on Spark, and then...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
The property engine.hive.enabled needs to be set to true and added to the table properties when creating the Iceberg table.
You may can refer to the Using Hive Catalog module of this article. https://iceberg.apache.org/hive/#using-hive-catalog
@zhangdove: Could you please share the HiveServer2 logs for the
SELECT * FROM db.tb
query? Maybe even DEBUG level logs, if possible. Also the results of theDESCRIBE FORMATTED db.tb
command from BeeLine might help? Thanks, Peter