question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deltastreamer -Global bloom Index resulting Duplicates across partitions for Same record Key

See original GitHub issue

Tips before filing an issue

  • Have you gone through our FAQs? Yes

  • Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced We are using Deltastreamer to read from topic and persist the data into HDFS location . enabled the GLOBAL_BLOOM index on COPY_ON_WRITE Storage Type with UPSERT mode, as our requirement is not to have duplicates across partitions ,but we are seeing duplicates for same record key across partitions .

We tried using two of the latest hudi utilities jars mentioned below , but the same result. hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar hudi-utilities-bundle_2.11-0.5.2-SNAPSHOT.jar

To Reproduce

Steps to reproduce the behavior:

Expected behavior

No Duplicates across partitions for the same record key.

Environment Description

  • Hudi version : 0.5.2

  • Spark version : 2.2.1

  • Hive version :

  • Hadoop version : Hadoop Version 2.7

  • Storage (HDFS/S3/GCS…) : HDFS

  • Running on Docker? (yes/no) : No

Additional context

Sample duplicate record :

scala> spark.sql("SELECT * FROM hudiTab where MBR_SYS_ID ='ABC:18:4587543:XZ:123456778'").show(5,false)
+-------------------+-------------------------+-------------------------------------------+----------------------+-----------------------------------------------------------------------+-------------------------------------------+-------+-------------+----------+-----------------------------------------------+
|_hoodie_commit_time|_hoodie_commit_seqno     |_hoodie_record_key                         |_hoodie_partition_path|_hoodie_file_name                                                      |MBR_SYS_ID                                 |INDV_ID|modifiedTs   |modifiedDt|indicators               |

|20200615211346     |20200615211346_24_1166263|ABC:18:4587543:XZ:123456778                |2020-06-15            |790b3252-c4f7-46ed-ad86-98877541b89b-0_24-22-717_20200615211346.parquet|ABC:18:4587543:XZ:123456778|                        |1592268506762|2020-06-15|[[2018-11-01,9999-12-31,N,Y,N,N,N,N,Y,1,1,Y,Y]]|
|20200616161512     |20200616161512_0_2586    |ABC:18:4587543:XZ:123456778                |2020-06-16            |3a47934b-ffb7-41f5-a9a4-6b4f40519e79-0_0-22-696_20200616161512.parquet |ABC:18:4587543:XZ:123456778|                        |1592339194660|2020-06-16|[[2018-11-01,9999-12-31,N,Y,N,N,N,N,Y,1,1,Y,Y]]|

Property file:

hoodie.datasource.write.recordkey.field=MBR_SYS_ID
hoodie.datasource.write.partitionpath.field=modifiedDt
hoodie.datasource.write.precombine.field=modifiedTs
hoodie.index.type=GLOBAL_BLOOM
hoodie.bloom.index.update.partition.path=true
hoodie.auto.commit=false
enable.auto.commit=false
hoodie.deltastreamer.kafka.source.maxEvents=10000000

#client.id=kaas.prod.elr.edzprod.mcm.hudi.new.cow
group.id=kaas.prod.elr.edzprod.mcm.hudi.cow.dedup.full.load
bootstrap.servers=xyyyyyyz:443
metadata.broker.list=xyyyyyyz:443
auto.offset.reset=earliest

Stacktrace our job is running every one hour reading from kafka topic and persisting into hudi dataset in HDFS location using deltastreamer . Total 117 Million data having 49K duplicate records on record key split across 5 partitions .

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (9 by maintainers)

github_iconTop GitHub Comments

0reactions
mingujotempcommented, Aug 4, 2020

Can you link a pr for this issue? @vinothchandar

Read more comments on GitHub >

github_iconTop Results From Across the Web

subject:"\[GitHub\] \[hudi\] prashanthpdesai ... - The Mail Archive
[GitHub] [hudi] prashanthpdesai commented on issue #1745: Deltastreamer -Global bloom Index resulting Duplicates across partitions for Same record Key.
Read more >
FAQs - Apache Hudi
How does Hudi handle duplicate record keys in an input? ... de-duping/enforce uniqueness across all partitions and the global bloom index does exactly...
Read more >
Employing the right indexes for fast updates, deletes in ...
Global index : Global indexes enforce uniqueness of keys across all partitions of a table i.e guarantees that exactly one record exists in ......
Read more >
Build your Apache Hudi data lake on AWS using Amazon EMR
With Hoodie keys, you can enable efficient updates and deletes on records, as well as avoid duplicate records. Hudi partitions have multiple ......
Read more >
Key Learnings on Using Apache HUDI in building Lakehouse ...
At Halodoc, we leverage the global Bloom index so that records are unique across the partitions. One has to make a decision based...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found