Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deltastreamer -Global bloom Index resulting Duplicates across partitions for Same record Key

See original GitHub issue

Tips before filing an issue

Have you gone through our FAQs? Yes
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced We are using Deltastreamer to read from topic and persist the data into HDFS location . enabled the GLOBAL_BLOOM index on COPY_ON_WRITE Storage Type with UPSERT mode, as our requirement is not to have duplicates across partitions ,but we are seeing duplicates for same record key across partitions .

We tried using two of the latest hudi utilities jars mentioned below , but the same result. hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar hudi-utilities-bundle_2.11-0.5.2-SNAPSHOT.jar

To Reproduce

Steps to reproduce the behavior:

Expected behavior

No Duplicates across partitions for the same record key.

Environment Description

Hudi version : 0.5.2
Spark version : 2.2.1
Hive version :
Hadoop version : Hadoop Version 2.7
Storage (HDFS/S3/GCS…) : HDFS
Running on Docker? (yes/no) : No

Additional context

Sample duplicate record :

scala> spark.sql("SELECT * FROM hudiTab where MBR_SYS_ID ='ABC:18:4587543:XZ:123456778'").show(5,false)
+-------------------+-------------------------+-------------------------------------------+----------------------+-----------------------------------------------------------------------+-------------------------------------------+-------+-------------+----------+-----------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno     |_hoodie_record_key                         |_hoodie_partition_path|_hoodie_file_name                                                      |MBR_SYS_ID                                 |INDV_ID|modifiedTs   |modifiedDt|indicators               |

|20200615211346     |20200615211346_24_1166263|ABC:18:4587543:XZ:123456778                |2020-06-15            |790b3252-c4f7-46ed-ad86-98877541b89b-0_24-22-717_20200615211346.parquet|ABC:18:4587543:XZ:123456778|                        |1592268506762|2020-06-15|[[2018-11-01,9999-12-31,N,Y,N,N,N,N,Y,1,1,Y,Y]]|
|20200616161512     |20200616161512_0_2586    |ABC:18:4587543:XZ:123456778                |2020-06-16            |3a47934b-ffb7-41f5-a9a4-6b4f40519e79-0_0-22-696_20200616161512.parquet |ABC:18:4587543:XZ:123456778|                        |1592339194660|2020-06-16|[[2018-11-01,9999-12-31,N,Y,N,N,N,N,Y,1,1,Y,Y]]|

Property file:

hoodie.datasource.write.recordkey.field=MBR_SYS_ID
hoodie.datasource.write.partitionpath.field=modifiedDt
hoodie.datasource.write.precombine.field=modifiedTs
hoodie.index.type=GLOBAL_BLOOM
hoodie.bloom.index.update.partition.path=true
hoodie.auto.commit=false
enable.auto.commit=false
hoodie.deltastreamer.kafka.source.maxEvents=10000000

#client.id=kaas.prod.elr.edzprod.mcm.hudi.new.cow
group.id=kaas.prod.elr.edzprod.mcm.hudi.cow.dedup.full.load
bootstrap.servers=xyyyyyyz:443
metadata.broker.list=xyyyyyyz:443
auto.offset.reset=earliest

Stacktrace our job is running every one hour reading from kafka topic and persisting into hudi dataset in HDFS location using deltastreamer . Total 117 Million data having 49K duplicate records on record key split across 5 partitions .

Issue Analytics

State:
Created 3 years ago
Comments:14 (9 by maintainers)

Top GitHub Comments

1reaction

nsivabalancommented, Aug 4, 2020

https://github.com/apache/hudi/pull/1793

0reactions

mingujotempcommented, Aug 4, 2020

Can you link a pr for this issue? @vinothchandar

Top Results From Across the Web

subject:"\[GitHub\] \[hudi\] prashanthpdesai ... - The Mail Archive

[GitHub] [hudi] prashanthpdesai commented on issue #1745: Deltastreamer -Global bloom Index resulting Duplicates across partitions for Same record Key.

FAQs - Apache Hudi

How does Hudi handle duplicate record keys in an input? ... de-duping/enforce uniqueness across all partitions and the global bloom index does exactly...

Employing the right indexes for fast updates, deletes in ...

Global index : Global indexes enforce uniqueness of keys across all partitions of a table i.e guarantees that exactly one record exists in ......

Build your Apache Hudi data lake on AWS using Amazon EMR

With Hoodie keys, you can enable efficient updates and deletes on records, as well as avoid duplicate records. Hudi partitions have multiple ......

Key Learnings on Using Apache HUDI in building Lakehouse ...

At Halodoc, we leverage the global Bloom index so that records are unique across the partitions. One has to make a decision based...