Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hive Sync issues on deletes and non partitioned table

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs? YES
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org. Part of the Slack groups. Did not find resolution there.
If you have triaged this as a bug, then file an issue directly. I am not sure this is a bug but after the analysis we can check.

Describe the problem you faced

I am working on unit tests for some of the operations we would perform. I see my tests failing for the below 2 scenarios:

Hive Table is not updated when DELETE operation is called on the dataset.
Hive Table is not updated (getting empty table) when no partitions are specified.

Details on Issue 1:

I am trying to sync a hive table on upsert (works fine) and on delete (does not work) in my unit tests. As per the doc Hudi_Writing-Data, we need to use GlobalDeleteKeyGenerator class for delete:

classOf[GlobalDeleteKeyGenerator].getCanonicalName (to be used when OPERATION_OPT_KEY is set to DELETE_OPERATION_OPT_VAL)

However, when I use this class I get the error:

“Cause: java.lang.InstantiationException: org.apache.hudi.keygen.GlobalDeleteKeyGenerator”

These are the hudi properties set on save:

(hoodie.datasource.hive_sync.database,default)
(hoodie.combine.before.insert,true)
(hoodie.insert.shuffle.parallelism,2)
(hoodie.datasource.write.precombine.field,timestamp)
(hoodie.datasource.hive_sync.partition_fields,partition)
(hoodie.datasource.hive_sync.use_jdbc,false)
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator)
(hoodie.delete.shuffle.parallelism,2)
(hoodie.datasource.hive_sync.table,TestHudiTable)
(hoodie.index.type,GLOBAL_BLOOM)
(hoodie.datasource.write.operation,DELETE)
(hoodie.datasource.hive_sync.enable,true)
(hoodie.datasource.write.recordkey.field,id)
(hoodie.table.name,TestHudiTable)
(hoodie.datasource.write.table.type,COPY_ON_WRITE)
(hoodie.datasource.write.hive_style_partitioning,true)
(hoodie.upsert.shuffle.parallelism,2)
(hoodie.cleaner.commits.retained,15)
(hoodie.datasource.write.partitionpath.field,partition)
(hoodie.datasource.hive_sync.database,default)
(hoodie.combine.before.insert,true)
(hoodie.embed.timeline.server,false)
(hoodie.insert.shuffle.parallelism,2)
(hoodie.datasource.write.precombine.field,timestamp)
(hoodie.datasource.hive_sync.partition_fields,partition)
(hoodie.datasource.hive_sync.use_jdbc,false)
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator)
(hoodie.delete.shuffle.parallelism,2)
(hoodie.datasource.hive_sync.table,TestHudiTable)
(hoodie.index.type,GLOBAL_BLOOM)
(hoodie.datasource.write.operation,DELETE)
(hoodie.datasource.hive_sync.enable,true)
(hoodie.datasource.write.recordkey.field,id)
(hoodie.table.name,TestHudiTable)
(hoodie.datasource.write.table.type,COPY_ON_WRITE)
(hoodie.datasource.write.hive_style_partitioning,true)
(hoodie.upsert.shuffle.parallelism,2)
(hoodie.cleaner.commits.retained,15)
(hoodie.datasource.write.partitionpath.field,partition)

if I switch to MultiPartKeysValueExtractor class, the deletes are not propagated to hive table. The hudi read - spark.read.format(“hudi”).load(BasePath) has the right data where the id is deleted but spark.table(“TableName”) is not consistent and still has the id which was supposed to be deleted. For example, id1 is deleted in Hudi but still in Hive table:

spark.read.format("hudi").load(<BasePath>)
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_nam                                                   |id |timestamp|partition|dat|
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|20210528125210     |20210528125210_0_3  |id2               |partition=p1          |9b3aae86-e4ed-4bfa-b6cd-ee41db11ac15-0_0-23-26_20210528125210.parquet|id2|1        |p1       |data2|
|20210528125210     |20210528125210_1_1  |id3               |partition=p2          |a53dba79-90a2-4370-8637-e39301b4c10c-0_1-23-27_20210528125210.parquet|id3|3        |p2       |data3|
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+

spark.table(<TableName>).show(false) //id1 supposed to be deleted, class used is MultiPartKeysValueExtractor, getting error with GlobalDeleteKeyGenerator.
+---+---------+-----+---------+
|id |timestamp|dat|partition|
+---+---------+-----+---------+
|id1|1        |data1|p1       |
|id2|1        |data2|p1       |
|id3|3        |data3|p2       |
+---+---------+-----+---------+

To Reproduce

Steps to reproduce the behavior:

  def getSnapshotData: Dataset[TestData] = {
    Seq(
      TestData(id = "id1", timestamp = 1, partition = "p1", data = "data1"),
      TestData(id = "id2", timestamp = 1, partition = "p1", data = "data2"),
      TestData(id = "id3", timestamp = 3, partition = "p2", data = "data3")
    ).toDS
  }

val hudiOptions = Map("hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.embed.timeline.server" -> "false", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.partition_fields" -> "partition", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "UPSERT", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15", "hoodie.datasource.write.partitionpath.field" -> "partition")

snapshot.write.mode("overwrite").options(hudiOptions).format("hudi").save(<BasePath>)

val deleteDF = snapshot.filter("id = 'id1'")

val deleteHudiOptions = Map("hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.partition_fields" -> "partition", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.keygen.GlobalDeleteKeyGenerator", "hoodie.delete.shuffle.parallelism" -> "2", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "DELETE", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15", "hoodie.datasource.write.partitionpath.field" -> "partition", "hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.embed.timeline.server" -> "false", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.partition_fields" -> "partition", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.keygen.GlobalDeleteKeyGenerator", "hoodie.delete.shuffle.parallelism" -> "2", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "DELETE", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15", "hoodie.datasource.write.partitionpath.field" -> "partition")

deleteDF.write.mode("append").options(deleteHudiOptions).format("hudi").save(<BasePath>)

Expected behavior

Expecting to delete records and sync the information in hive.

Environment Description

Hudi version : 0.8

Spark version : 2.4

Hive version : 2.5

Hadoop version : 2.5

Storage (HDFS/S3/GCS…) : Local testing. Production datasets in S3.

Running on Docker? (yes/no) : no

Stacktrace

 - should hard delete records from hudi table with hive sync *** FAILED *** (24 seconds, 49 milliseconds)
Cause: java.lang.NoSuchMethodException: org.apache.hudi.keygen.GlobalDeleteKeyGenerator.<init>()
[scalatest]   at java.lang.Class.getConstructor0(Class.java:3110)
[scalatest]   at java.lang.Class.newInstance(Class.java:412)
[scalatest]   at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:98)
[scalatest]   at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:69)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
[scalatest]   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
[scalatest]   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
[scalatest]   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[scalatest]   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[scalatest]   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[scalatest]   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
[scalatest]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
[scalatest]   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[scalatest]   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[scalatest]   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
[scalatest]   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
[scalatest]   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[scalatest]   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[scalatest]   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
[scalatest]   at com.amazon.sm.hudi.framework.common.TestUtils.writeHudiData(TestUtils.scala:148)

Details on Issue 2:

I am trying to sync a hive table on upsert in my unit tests. Test was for a Non partitioned table. As per the doc Hudi_Non_partitioned, we need to set the below properties for Non-partitioned tables:

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor

However, even after setting the above properties, hive table shows empty records. I found a related issue in GitHub_Issue but it is related to complex record key for non-partitioned table. Support for Complex key in Non partitioned table is also required for us and I will track the above SIM for the support.

These are the hudi properties set on save:

(hoodie.datasource.hive_sync.database,default)
(hoodie.combine.before.insert,true)
(hoodie.embed.timeline.server,false)
(hoodie.insert.shuffle.parallelism,2)
(hoodie.datasource.write.precombine.field,timestamp)
(hoodie.datasource.hive_sync.use_jdbc,false)
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.hive.NonPartitionedExtractor)
(hoodie.datasource.hive_sync.table,TestHudiTable)
(hoodie.index.type,GLOBAL_BLOOM)
(hoodie.datasource.write.operation,UPSERT)
(hoodie.datasource.hive_sync.enable,true)
(hoodie.datasource.write.recordkey.field,id)
(hoodie.table.name,TestHudiTable)
(hoodie.datasource.write.table.type,COPY_ON_WRITE)
(hoodie.datasource.write.hive_style_partitioning,true)
(hoodie.datasource.write.keygenerator.class,org.apache.hudi.keygen.NonpartitionedKeyGenerator)
(hoodie.upsert.shuffle.parallelism,2)
(hoodie.cleaner.commits.retained,15)

Results:

spark.read.format("hudi").load(<BasePath>)
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                    |id |timestamp|partition|data |
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|20210528130615     |20210528130615_0_1  |id1               |                      |d739bf7d-9152-4565-ab01-9bf3e96d9997-0_0-29-31_20210528130615.parquet|id1|1        |p1       |data1|
|20210528130615     |20210528130615_0_2  |id3               |                      |d739bf7d-9152-4565-ab01-9bf3e96d9997-0_0-29-31_20210528130615.parquet|id3|3        |p2       |data3|
|20210528130615     |20210528130615_0_3  |id2               |                      |d739bf7d-9152-4565-ab01-9bf3e96d9997-0_0-29-31_20210528130615.parquet|id2|1        |p1       |data2|
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+

spark.table(<TableName>).show(false) 
+---+---------+---------+----+
|id |timestamp|partition|data|
+---+---------+---------+----+
+---+---------+---------+----+

To Reproduce

Steps to reproduce the behavior:

  def getSnapshotData: Dataset[TestData] = {
    Seq(
      TestData(id = "id1", timestamp = 1, partition = "p1", data = "data1"),
      TestData(id = "id2", timestamp = 1, partition = "p1", data = "data2"),
      TestData(id = "id3", timestamp = 3, partition = "p2", data = "data3")
    ).toDS
  }

val hudiOptions = Map("hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.embed.timeline.server" -> "false", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.NonPartitionedExtractor", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "UPSERT", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15")

snapshot.write.mode("overwrite").options(hudiOptions).format("hudi").save(<BasePath>)

Expected behavior

Expecting to see the records in Hive table for non partitioned dataset.

Environment Description

Hudi version : 0.8

Spark version : 2.4

Hive version : 2.5

Hadoop version : 2.5

Storage (HDFS/S3/GCS…) : Local testing. Production datasets in S3.

Running on Docker? (yes/no) : no

Issue Analytics

State:
Created 2 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

nverdhancommented, Aug 31, 2022

The second problem still exists.

Setting

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor

for non partitioned data doesn’t show data after syncing to Glue. Querying data shows empty results while the parquet file has all the columns. The schema in Glue shows “Partition(0)” against the column it was synced but probably doesn’t show data because all partitions are empty. If I edit the Glue table schema and remove Partition field, it starts showing up the data correctly.

The workaround for me was to use a org.apache.hudi.keygen.SimpleKeyGenerator with a constant column. (I was using Debezium, so I had db_shard_source_partition column to partition with)

0reactions

Gatsby-Leecommented, Nov 22, 2022

@pranotishanbhag why do you have to use “GlobalDeleteKeyGenerator” to delete records? For me, I use the same key for INSERT/UPDATE/DELETE for a table.