[SUPPORT] Slow upsert time reading from Kafka
See original GitHub issueHi all, We’re experiencing strange issues using deltastreamer with hudi 0.5.2 version. We’re reading from a Kafka source, in particular from a compacted topic with 50 partitions. We’re partitioning via a custom KeyResolver which basically is partitioning similarly to Kafka (murmur3hash(recordKey) mod n°_of_partitions). What we see is that during the first three runs everything goes smoothly (each run ingests 5mln records). At the fourth run, suddenly the process really slows down. Speaking about job stages, we saw that the countByKey is the step that is taking too long, with low cluster usage/load (it is shuffling?)
Here the hudi properties we’re using:
# Hoodie properties hoodie.upsert.shuffle.parallelism=5 hoodie.insert.shuffle.parallelism=5 hoodie.bulkinsert.shuffle.parallelism=5 hoodie.embed.timeline.server=true hoodie.filesystem.view.type=EMBEDDED_KV_STORE hoodie.compact.inline=false hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS hoodie.clean.automatic=true hoodie.combine.before.upsert=true hoodie.cleaner.fileversions.retained=1 hoodie.bloom.index.prune.by.ranges=false hoodie.index.bloom.num_entries=1000000
Last run (the one is taking too long):
First, second and third run (that went very well):
thank you in advance!
PS: Our recordKeys are UUID type 4 PPSS: We’re running deltastreamer on a EMR 5.28.0 cluster, writing hudi files on S3
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (8 by maintainers)
Top GitHub Comments
Hi guys, As stated in chat, this is not related to Hudi in general. At first sight we thought that the problem was due this: https://issues.apache.org/jira/browse/KAFKA-4753. Indeed, the problem could be addressed by the fact that we’re using transactional producers, and offsets in Kafka with transactional producers have different meaning and some of them can be used to handle transactions. This offset usage is probably the thing that makes our consumers hanging up in the “last run” due to the fact that the ending offset is not reachable. So this problem probably relates to this. Have a look at: https://stackoverflow.com/questions/59763422/in-my-kafka-topic-end-of-the-offset-is-higher-than-last-messagess-offset-numbe https://issues.apache.org/jira/browse/KAFKA-8358 https://stackoverflow.com/questions/56182606/in-kafka-when-producing-message-with-transactional-consumer-offset-doubled-up
I’m closing the issue, thank you for your support!
This seems like a performance issue, thanks for reporting this, will delve into this. : )