question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[SUPPORT] Slow upsert time reading from Kafka

See original GitHub issue

Hi all, We’re experiencing strange issues using deltastreamer with hudi 0.5.2 version. We’re reading from a Kafka source, in particular from a compacted topic with 50 partitions. We’re partitioning via a custom KeyResolver which basically is partitioning similarly to Kafka (murmur3hash(recordKey) mod n°_of_partitions). What we see is that during the first three runs everything goes smoothly (each run ingests 5mln records). At the fourth run, suddenly the process really slows down. Speaking about job stages, we saw that the countByKey is the step that is taking too long, with low cluster usage/load (it is shuffling?)

Here the hudi properties we’re using: # Hoodie properties hoodie.upsert.shuffle.parallelism=5 hoodie.insert.shuffle.parallelism=5 hoodie.bulkinsert.shuffle.parallelism=5 hoodie.embed.timeline.server=true hoodie.filesystem.view.type=EMBEDDED_KV_STORE hoodie.compact.inline=false hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS hoodie.clean.automatic=true hoodie.combine.before.upsert=true hoodie.cleaner.fileversions.retained=1 hoodie.bloom.index.prune.by.ranges=false hoodie.index.bloom.num_entries=1000000

Last run (the one is taking too long): Screenshot 2020-05-07 at 15 32 24

Screenshot 2020-05-07 at 15 32 32

First, second and third run (that went very well): firsrun secondrun thirdrun

thank you in advance!

PS: Our recordKeys are UUID type 4 PPSS: We’re running deltastreamer on a EMR 5.28.0 cluster, writing hudi files on S3

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
reste85commented, May 18, 2020

Hi guys, As stated in chat, this is not related to Hudi in general. At first sight we thought that the problem was due this: https://issues.apache.org/jira/browse/KAFKA-4753. Indeed, the problem could be addressed by the fact that we’re using transactional producers, and offsets in Kafka with transactional producers have different meaning and some of them can be used to handle transactions. This offset usage is probably the thing that makes our consumers hanging up in the “last run” due to the fact that the ending offset is not reachable. So this problem probably relates to this. Have a look at: https://stackoverflow.com/questions/59763422/in-my-kafka-topic-end-of-the-offset-is-higher-than-last-messagess-offset-numbe https://issues.apache.org/jira/browse/KAFKA-8358 https://stackoverflow.com/questions/56182606/in-kafka-when-producing-message-with-transactional-consumer-offset-doubled-up

I’m closing the issue, thank you for your support!

1reaction
lamberkencommented, May 12, 2020

This seems like a performance issue, thanks for reporting this, will delve into this. : )

Read more comments on GitHub >

github_iconTop Results From Across the Web

Monitoring Kafka Performance Metrics | Datadog
This metric reports the number of partitions without an active leader. Because all read and write operations are only performed on partition ...
Read more >
Performance Tuning RocksDB for Kafka Streams' State Stores
Once a consumer within a Kafka Streams client exceeds this delay, the consumer is kicked out of the consumer group, leading to recurring ......
Read more >
Chapter 4. Kafka Consumers: Reading Data from Kafka
Applications that need to read data from Kafka use a KafkaConsumer to ... it will need to refresh its caches—slowing down the application...
Read more >
Using Apache Kafka for Real-Time Event Processing at New ...
You have two topic retention options in this situation. The first option is to set the topic retention time long enough to catch...
Read more >
Performance Tuning of an Apache Kafka/Spark Streaming ...
The input data was unbalanced, and most of the application processing time was spent processing Topic 1 (with 85% of the throughput). Kafka...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found