Deltastreamer -Global bloom Index resulting Duplicates across partitions for Same record Key
See original GitHub issueTips before filing an issue
-
Have you gone through our FAQs? Yes
-
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced We are using Deltastreamer to read from topic and persist the data into HDFS location . enabled the GLOBAL_BLOOM index on COPY_ON_WRITE Storage Type with UPSERT mode, as our requirement is not to have duplicates across partitions ,but we are seeing duplicates for same record key across partitions .
We tried using two of the latest hudi utilities jars mentioned below , but the same result. hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar hudi-utilities-bundle_2.11-0.5.2-SNAPSHOT.jar
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No Duplicates across partitions for the same record key.
Environment Description
-
Hudi version : 0.5.2
-
Spark version : 2.2.1
-
Hive version :
-
Hadoop version : Hadoop Version 2.7
-
Storage (HDFS/S3/GCS…) : HDFS
-
Running on Docker? (yes/no) : No
Additional context
Sample duplicate record :
scala> spark.sql("SELECT * FROM hudiTab where MBR_SYS_ID ='ABC:18:4587543:XZ:123456778'").show(5,false)
+-------------------+-------------------------+-------------------------------------------+----------------------+-----------------------------------------------------------------------+-------------------------------------------+-------+-------------+----------+-----------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno |_hoodie_record_key |_hoodie_partition_path|_hoodie_file_name |MBR_SYS_ID |INDV_ID|modifiedTs |modifiedDt|indicators |
|20200615211346 |20200615211346_24_1166263|ABC:18:4587543:XZ:123456778 |2020-06-15 |790b3252-c4f7-46ed-ad86-98877541b89b-0_24-22-717_20200615211346.parquet|ABC:18:4587543:XZ:123456778| |1592268506762|2020-06-15|[[2018-11-01,9999-12-31,N,Y,N,N,N,N,Y,1,1,Y,Y]]|
|20200616161512 |20200616161512_0_2586 |ABC:18:4587543:XZ:123456778 |2020-06-16 |3a47934b-ffb7-41f5-a9a4-6b4f40519e79-0_0-22-696_20200616161512.parquet |ABC:18:4587543:XZ:123456778| |1592339194660|2020-06-16|[[2018-11-01,9999-12-31,N,Y,N,N,N,N,Y,1,1,Y,Y]]|
Property file:
hoodie.datasource.write.recordkey.field=MBR_SYS_ID
hoodie.datasource.write.partitionpath.field=modifiedDt
hoodie.datasource.write.precombine.field=modifiedTs
hoodie.index.type=GLOBAL_BLOOM
hoodie.bloom.index.update.partition.path=true
hoodie.auto.commit=false
enable.auto.commit=false
hoodie.deltastreamer.kafka.source.maxEvents=10000000
#client.id=kaas.prod.elr.edzprod.mcm.hudi.new.cow
group.id=kaas.prod.elr.edzprod.mcm.hudi.cow.dedup.full.load
bootstrap.servers=xyyyyyyz:443
metadata.broker.list=xyyyyyyz:443
auto.offset.reset=earliest
Stacktrace our job is running every one hour reading from kafka topic and persisting into hudi dataset in HDFS location using deltastreamer . Total 117 Million data having 49K duplicate records on record key split across 5 partitions .
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (9 by maintainers)
Top GitHub Comments
https://github.com/apache/hudi/pull/1793
Can you link a pr for this issue? @vinothchandar