question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Hive Sync issues on deletes and non partitioned table

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs? YES

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org. Part of the Slack groups. Did not find resolution there.

  • If you have triaged this as a bug, then file an issue directly. I am not sure this is a bug but after the analysis we can check.

Describe the problem you faced

I am working on unit tests for some of the operations we would perform. I see my tests failing for the below 2 scenarios:

  1. Hive Table is not updated when DELETE operation is called on the dataset.
  2. Hive Table is not updated (getting empty table) when no partitions are specified.

Details on Issue 1:

I am trying to sync a hive table on upsert (works fine) and on delete (does not work) in my unit tests. As per the doc Hudi_Writing-Data, we need to use GlobalDeleteKeyGenerator class for delete:

classOf[GlobalDeleteKeyGenerator].getCanonicalName (to be used when OPERATION_OPT_KEY is set to DELETE_OPERATION_OPT_VAL)

However, when I use this class I get the error:

“Cause: java.lang.InstantiationException: org.apache.hudi.keygen.GlobalDeleteKeyGenerator”

These are the hudi properties set on save:

(hoodie.datasource.hive_sync.database,default)
(hoodie.combine.before.insert,true)
(hoodie.insert.shuffle.parallelism,2)
(hoodie.datasource.write.precombine.field,timestamp)
(hoodie.datasource.hive_sync.partition_fields,partition)
(hoodie.datasource.hive_sync.use_jdbc,false)
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator)
(hoodie.delete.shuffle.parallelism,2)
(hoodie.datasource.hive_sync.table,TestHudiTable)
(hoodie.index.type,GLOBAL_BLOOM)
(hoodie.datasource.write.operation,DELETE)
(hoodie.datasource.hive_sync.enable,true)
(hoodie.datasource.write.recordkey.field,id)
(hoodie.table.name,TestHudiTable)
(hoodie.datasource.write.table.type,COPY_ON_WRITE)
(hoodie.datasource.write.hive_style_partitioning,true)
(hoodie.upsert.shuffle.parallelism,2)
(hoodie.cleaner.commits.retained,15)
(hoodie.datasource.write.partitionpath.field,partition)
(hoodie.datasource.hive_sync.database,default)
(hoodie.combine.before.insert,true)
(hoodie.embed.timeline.server,false)
(hoodie.insert.shuffle.parallelism,2)
(hoodie.datasource.write.precombine.field,timestamp)
(hoodie.datasource.hive_sync.partition_fields,partition)
(hoodie.datasource.hive_sync.use_jdbc,false)
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator)
(hoodie.delete.shuffle.parallelism,2)
(hoodie.datasource.hive_sync.table,TestHudiTable)
(hoodie.index.type,GLOBAL_BLOOM)
(hoodie.datasource.write.operation,DELETE)
(hoodie.datasource.hive_sync.enable,true)
(hoodie.datasource.write.recordkey.field,id)
(hoodie.table.name,TestHudiTable)
(hoodie.datasource.write.table.type,COPY_ON_WRITE)
(hoodie.datasource.write.hive_style_partitioning,true)
(hoodie.upsert.shuffle.parallelism,2)
(hoodie.cleaner.commits.retained,15)
(hoodie.datasource.write.partitionpath.field,partition)

if I switch to MultiPartKeysValueExtractor class, the deletes are not propagated to hive table. The hudi read - spark.read.format(“hudi”).load(BasePath) has the right data where the id is deleted but spark.table(“TableName”) is not consistent and still has the id which was supposed to be deleted. For example, id1 is deleted in Hudi but still in Hive table:

spark.read.format("hudi").load(<BasePath>)
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_nam                                                   |id |timestamp|partition|dat|
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|20210528125210     |20210528125210_0_3  |id2               |partition=p1          |9b3aae86-e4ed-4bfa-b6cd-ee41db11ac15-0_0-23-26_20210528125210.parquet|id2|1        |p1       |data2|
|20210528125210     |20210528125210_1_1  |id3               |partition=p2          |a53dba79-90a2-4370-8637-e39301b4c10c-0_1-23-27_20210528125210.parquet|id3|3        |p2       |data3|
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+

spark.table(<TableName>).show(false) //id1 supposed to be deleted, class used is MultiPartKeysValueExtractor, getting error with GlobalDeleteKeyGenerator.
+---+---------+-----+---------+
|id |timestamp|dat|partition|
+---+---------+-----+---------+
|id1|1        |data1|p1       |
|id2|1        |data2|p1       |
|id3|3        |data3|p2       |
+---+---------+-----+---------+

To Reproduce

Steps to reproduce the behavior:

  def getSnapshotData: Dataset[TestData] = {
    Seq(
      TestData(id = "id1", timestamp = 1, partition = "p1", data = "data1"),
      TestData(id = "id2", timestamp = 1, partition = "p1", data = "data2"),
      TestData(id = "id3", timestamp = 3, partition = "p2", data = "data3")
    ).toDS
  }

val hudiOptions = Map("hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.embed.timeline.server" -> "false", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.partition_fields" -> "partition", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "UPSERT", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15", "hoodie.datasource.write.partitionpath.field" -> "partition")

snapshot.write.mode("overwrite").options(hudiOptions).format("hudi").save(<BasePath>)

val deleteDF = snapshot.filter("id = 'id1'")

val deleteHudiOptions = Map("hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.partition_fields" -> "partition", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.keygen.GlobalDeleteKeyGenerator", "hoodie.delete.shuffle.parallelism" -> "2", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "DELETE", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15", "hoodie.datasource.write.partitionpath.field" -> "partition", "hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.embed.timeline.server" -> "false", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.partition_fields" -> "partition", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.keygen.GlobalDeleteKeyGenerator", "hoodie.delete.shuffle.parallelism" -> "2", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "DELETE", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15", "hoodie.datasource.write.partitionpath.field" -> "partition")

deleteDF.write.mode("append").options(deleteHudiOptions).format("hudi").save(<BasePath>)

Expected behavior

Expecting to delete records and sync the information in hive.

Environment Description

Hudi version : 0.8

Spark version : 2.4

Hive version : 2.5

Hadoop version : 2.5

Storage (HDFS/S3/GCS…) : Local testing. Production datasets in S3.

Running on Docker? (yes/no) : no

Stacktrace

 - should hard delete records from hudi table with hive sync *** FAILED *** (24 seconds, 49 milliseconds)
Cause: java.lang.NoSuchMethodException: org.apache.hudi.keygen.GlobalDeleteKeyGenerator.<init>()
[scalatest]   at java.lang.Class.getConstructor0(Class.java:3110)
[scalatest]   at java.lang.Class.newInstance(Class.java:412)
[scalatest]   at org.apache.hudi.hive.HoodieHiveClient.<init>(HoodieHiveClient.java:98)
[scalatest]   at org.apache.hudi.hive.HiveSyncTool.<init>(HiveSyncTool.java:69)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
[scalatest]   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
[scalatest]   at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
[scalatest]   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
[scalatest]   at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[scalatest]   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[scalatest]   at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[scalatest]   at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
[scalatest]   at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
[scalatest]   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
[scalatest]   at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
[scalatest]   at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
[scalatest]   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
[scalatest]   at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
[scalatest]   at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
[scalatest]   at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
[scalatest]   at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
[scalatest]   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
[scalatest]   at com.amazon.sm.hudi.framework.common.TestUtils.writeHudiData(TestUtils.scala:148)

Details on Issue 2:

I am trying to sync a hive table on upsert in my unit tests. Test was for a Non partitioned table. As per the doc Hudi_Non_partitioned, we need to set the below properties for Non-partitioned tables:

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor

However, even after setting the above properties, hive table shows empty records. I found a related issue in GitHub_Issue but it is related to complex record key for non-partitioned table. Support for Complex key in Non partitioned table is also required for us and I will track the above SIM for the support.

These are the hudi properties set on save:

(hoodie.datasource.hive_sync.database,default)
(hoodie.combine.before.insert,true)
(hoodie.embed.timeline.server,false)
(hoodie.insert.shuffle.parallelism,2)
(hoodie.datasource.write.precombine.field,timestamp)
(hoodie.datasource.hive_sync.use_jdbc,false)
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.hive.NonPartitionedExtractor)
(hoodie.datasource.hive_sync.table,TestHudiTable)
(hoodie.index.type,GLOBAL_BLOOM)
(hoodie.datasource.write.operation,UPSERT)
(hoodie.datasource.hive_sync.enable,true)
(hoodie.datasource.write.recordkey.field,id)
(hoodie.table.name,TestHudiTable)
(hoodie.datasource.write.table.type,COPY_ON_WRITE)
(hoodie.datasource.write.hive_style_partitioning,true)
(hoodie.datasource.write.keygenerator.class,org.apache.hudi.keygen.NonpartitionedKeyGenerator)
(hoodie.upsert.shuffle.parallelism,2)
(hoodie.cleaner.commits.retained,15)

Results:

spark.read.format("hudi").load(<BasePath>)
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                    |id |timestamp|partition|data |
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+
|20210528130615     |20210528130615_0_1  |id1               |                      |d739bf7d-9152-4565-ab01-9bf3e96d9997-0_0-29-31_20210528130615.parquet|id1|1        |p1       |data1|
|20210528130615     |20210528130615_0_2  |id3               |                      |d739bf7d-9152-4565-ab01-9bf3e96d9997-0_0-29-31_20210528130615.parquet|id3|3        |p2       |data3|
|20210528130615     |20210528130615_0_3  |id2               |                      |d739bf7d-9152-4565-ab01-9bf3e96d9997-0_0-29-31_20210528130615.parquet|id2|1        |p1       |data2|
+-------------------+--------------------+------------------+----------------------+---------------------------------------------------------------------+---+---------+---------+-----+

spark.table(<TableName>).show(false) 
+---+---------+---------+----+
|id |timestamp|partition|data|
+---+---------+---------+----+
+---+---------+---------+----+

To Reproduce

Steps to reproduce the behavior:

  def getSnapshotData: Dataset[TestData] = {
    Seq(
      TestData(id = "id1", timestamp = 1, partition = "p1", data = "data1"),
      TestData(id = "id2", timestamp = 1, partition = "p1", data = "data2"),
      TestData(id = "id3", timestamp = 3, partition = "p2", data = "data3")
    ).toDS
  }

val hudiOptions = Map("hoodie.datasource.hive_sync.database" -> "default", "hoodie.combine.before.insert" -> "true", "hoodie.embed.timeline.server" -> "false", "hoodie.insert.shuffle.parallelism" -> "2", "hoodie.datasource.write.precombine.field" -> "timestamp", "hoodie.datasource.hive_sync.use_jdbc" -> "false", "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.NonPartitionedExtractor", "hoodie.datasource.hive_sync.table" -> "TestHudiTable", "hoodie.index.type" -> "GLOBAL_BLOOM", "hoodie.datasource.write.operation" -> "UPSERT", "hoodie.datasource.hive_sync.enable" -> "true", "hoodie.datasource.write.recordkey.field" -> "id", "hoodie.table.name" -> "TestHudiTable", "hoodie.datasource.write.table.type" -> "COPY_ON_WRITE", "hoodie.datasource.write.hive_style_partitioning" -> "true", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.NonpartitionedKeyGenerator", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.cleaner.commits.retained" -> "15")

snapshot.write.mode("overwrite").options(hudiOptions).format("hudi").save(<BasePath>)

Expected behavior

Expecting to see the records in Hive table for non partitioned dataset.

Environment Description

Hudi version : 0.8

Spark version : 2.4

Hive version : 2.5

Hadoop version : 2.5

Storage (HDFS/S3/GCS…) : Local testing. Production datasets in S3.

Running on Docker? (yes/no) : no

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
nverdhancommented, Aug 31, 2022

The second problem still exists.

Setting

hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor

for non partitioned data doesn’t show data after syncing to Glue. Querying data shows empty results while the parquet file has all the columns. The schema in Glue shows “Partition(0)” against the column it was synced but probably doesn’t show data because all partitions are empty. If I edit the Glue table schema and remove Partition field, it starts showing up the data correctly.

The workaround for me was to use a org.apache.hudi.keygen.SimpleKeyGenerator with a constant column. (I was using Debezium, so I had db_shard_source_partition column to partition with)

0reactions
Gatsby-Leecommented, Nov 22, 2022

@pranotishanbhag why do you have to use “GlobalDeleteKeyGenerator” to delete records? For me, I use the same key for INSERT/UPDATE/DELETE for a table.

Read more comments on GitHub >

github_iconTop Results From Across the Web

delete query from hive table (with partition) not working
I am trying to delete some of the rows from my hive table which has partitions. This is what I did. delete -...
Read more >
How to Update or Drop a Hive Partition? - Spark by {Examples}
Hive ALTER TABLE command is used to update or drop a partition from a Hive Metastore and HDFS location (managed table). You can...
Read more >
How to update partition metadata in Hive , when partition data ...
Drop the table ( DROP TABLE table_name ) (dropping an external table does not delete the underlying partition files); Recreate the table ......
Read more >
5: Hive Partitions, sync_partition_metadata, and Query ... - Trino
What this function does is similar to Hive's MSCK REPAIR TABLE where if it finds a hive partition directory in the filesystem that...
Read more >
Troubleshoot Athena query failing with the error ...
When you run a query on an Athena partitioned table, Athena validates the table schema and the schema of its partitions in the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found