question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New kafka metrics is really slow

See original GitHub issue

I’ve an application with a simple Kafka Stream. I’ve added latest implementation of KafkaStreamsMetrics.

The loading time of my /prometheus endpoint is very long and consuming a lot of cpu. I try to troubleshot this and I found this to be really slow :

https://github.com/micrometer-metrics/micrometer/blob/d9655b1f89944a7d14460168b68fcd2bb05b6ddf/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L154-L160

If I removed the double check, loading time is fine again :

//Double-check if new metrics are registered
checkAndBindMetrics(registry);

With checkAndBindMetrics :

time curl http://kestra:2553/prometheus
real	0m6,209s
user	0m0,015s
sys	0m0,001s

Without checkAndBindMetrics :

time curl http://kestra:2553/prometheus
real	0m0,079s
user	0m0,013s
sys	0m0,013s

Since in Kafka metrics, there is a lot of gauge with a lot of tags (metrics x topic x partition x thread, 300 for my application), I think we must not do this check or used a cache if necessary (didn’t see any impact removing this check), this check will fetch all the metrics and use a synchronized methods and will be really slow, since call for every metrics.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
tchiotludocommented, Mar 27, 2020

Hello. It’s a simple stream application with this topology (no other process involve except the stream).

Topologies:
   Sub-topology: 0
    Source: KSTREAM-SOURCE-0000000000 (topics: [RAW-ESB-RECEIPTSYNC-V1])
      --> KSTREAM-PEEK-0000000001
    Processor: KSTREAM-PEEK-0000000001 (stores: [])
      --> KSTREAM-MAPVALUES-0000000002
      <-- KSTREAM-SOURCE-0000000000
    Processor: KSTREAM-MAPVALUES-0000000002 (stores: [])
      --> KSTREAM-MAPVALUES-0000000003
      <-- KSTREAM-PEEK-0000000001
    Processor: KSTREAM-MAPVALUES-0000000003 (stores: [])
      --> KSTREAM-FILTER-0000000004
      <-- KSTREAM-MAPVALUES-0000000002
    Processor: KSTREAM-FILTER-0000000004 (stores: [])
      --> KSTREAM-TRANSFORM-0000000005
      <-- KSTREAM-MAPVALUES-0000000003
    Processor: KSTREAM-TRANSFORM-0000000005 (stores: [])
      --> KSTREAM-PEEK-0000000006
      <-- KSTREAM-FILTER-0000000004
    Processor: KSTREAM-PEEK-0000000006 (stores: [])
      --> KSTREAM-MAPVALUES-0000000010, KSTREAM-SINK-0000000007
      <-- KSTREAM-TRANSFORM-0000000005
    Processor: KSTREAM-MAPVALUES-0000000010 (stores: [])
      --> KSTREAM-TRANSFORMVALUES-0000000011
      <-- KSTREAM-PEEK-0000000006
    Processor: KSTREAM-TRANSFORMVALUES-0000000011 (stores: [])
      --> KSTREAM-SINK-0000000012
      <-- KSTREAM-MAPVALUES-0000000010
    Sink: KSTREAM-SINK-0000000007 (topic: RAW-SALES-TICKET-V1)
      <-- KSTREAM-PEEK-0000000006
    Sink: KSTREAM-SINK-0000000012 (topic: EDM-SALES-CUSTOMERRECEIPT-V1)
      <-- KSTREAM-TRANSFORMVALUES-0000000011

  Sub-topology: 1 for global store (will not generate tasks)
    Source: KSTREAM-SOURCE-0000000008 (topics: [EDM-OFFERS-PRODUCT-V1])
      --> KTABLE-SOURCE-0000000009
    Processor: KTABLE-SOURCE-0000000009 (stores: [product_join])
      --> none
      <-- KSTREAM-SOURCE-0000000008

So : 5 topics with 24 partitions.

It’s always the same topics, no changed at all on this. The performance penalty is on all call, not only on first call.

In my app I’ve 258 gauge generated (x number of tags) = 8910 generated metrics from this kafka stream.

After analyse the checkAndBindMetrics(registry); is called on every collection of the gauge for every gauge, on all hit to the prometheus page (so 8910 called in my case)

since checkAndBindMetrics will collect all metrics and forEach all of them in a synchronized way, it mean that lead to foreach millions of entry at the end of the prometheus call.

1reaction
tchiotludocommented, Apr 6, 2020

Forgot the concurrent part, I discovered that no concurrent call is possible :

https://stackoverflow.com/a/35498230/1590168

This does not mean that there will not no concurrency over all scheduled tasks. Rather, for each task (created by invocation of scheduleAtFixedRate), the Runnable only executes on one thread at a time - even if the execution time overruns the interval.

So we can remove the synchonized without risk

Read more comments on GitHub >

github_iconTop Results From Across the Web

Slow performance on kafka - Ops - Confluent Community
Learn how to pinpoint common Kafka issues, which producer metrics to monitor, and how to optimize Kafka to keep latency low and throughput...
Read more >
Monitoring Kafka Performance Metrics | Datadog
To keep your Kafka cluster running smoothly, you need to know which metrics to monitor. Learn about metrics from your Kafka brokers, ...
Read more >
Apache Kafka Consumer Lag Monitoring - Sematext
Learn how to monitor Consumer Lag in Apache Kafka. Tutorial on how to calculate and avoid it with Kafka monitoring tools.
Read more >
Understanding the lag in your Kafka cluster - Acceldata
Amongst various metrics that Kafka monitoring includes consumer lag is nearly the most important of them all. In this blog we will explore ......
Read more >
Why kafka producer is very slow on first message?
If your Kafka environment is static, that is, new brokers and partitions are not created while your application is running, then consider ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found