New kafka metrics is really slow
See original GitHub issueI’ve an application with a simple Kafka Stream. I’ve added latest implementation of KafkaStreamsMetrics.
The loading time of my /prometheus
endpoint is very long and consuming a lot of cpu.
I try to troubleshot this and I found this to be really slow :
If I removed the double check, loading time is fine again :
//Double-check if new metrics are registered
checkAndBindMetrics(registry);
With checkAndBindMetrics
:
time curl http://kestra:2553/prometheus
real 0m6,209s
user 0m0,015s
sys 0m0,001s
Without checkAndBindMetrics
:
time curl http://kestra:2553/prometheus
real 0m0,079s
user 0m0,013s
sys 0m0,013s
Since in Kafka metrics, there is a lot of gauge with a lot of tags (metrics x topic x partition x thread, 300 for my application), I think we must not do this check or used a cache if necessary (didn’t see any impact removing this check), this check will fetch all the metrics and use a synchronized methods and will be really slow, since call for every metrics.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Slow performance on kafka - Ops - Confluent Community
Learn how to pinpoint common Kafka issues, which producer metrics to monitor, and how to optimize Kafka to keep latency low and throughput...
Read more >Monitoring Kafka Performance Metrics | Datadog
To keep your Kafka cluster running smoothly, you need to know which metrics to monitor. Learn about metrics from your Kafka brokers, ...
Read more >Apache Kafka Consumer Lag Monitoring - Sematext
Learn how to monitor Consumer Lag in Apache Kafka. Tutorial on how to calculate and avoid it with Kafka monitoring tools.
Read more >Understanding the lag in your Kafka cluster - Acceldata
Amongst various metrics that Kafka monitoring includes consumer lag is nearly the most important of them all. In this blog we will explore ......
Read more >Why kafka producer is very slow on first message?
If your Kafka environment is static, that is, new brokers and partitions are not created while your application is running, then consider ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello. It’s a simple stream application with this topology (no other process involve except the stream).
So : 5 topics with 24 partitions.
It’s always the same topics, no changed at all on this. The performance penalty is on all call, not only on first call.
In my app I’ve 258 gauge generated (x number of tags) = 8910 generated metrics from this kafka stream.
After analyse the
checkAndBindMetrics(registry);
is called on every collection of the gauge for every gauge, on all hit to the prometheus page (so 8910 called in my case)since checkAndBindMetrics will collect all metrics and forEach all of them in a synchronized way, it mean that lead to foreach millions of entry at the end of the prometheus call.
Forgot the concurrent part, I discovered that no concurrent call is possible :
https://stackoverflow.com/a/35498230/1590168
So we can remove the synchonized without risk