Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

New kafka metrics is really slow

See original GitHub issue

I’ve an application with a simple Kafka Stream. I’ve added latest implementation of KafkaStreamsMetrics.

The loading time of my /prometheus endpoint is very long and consuming a lot of cpu. I try to troubleshot this and I found this to be really slow :

https://github.com/micrometer-metrics/micrometer/blob/d9655b1f89944a7d14460168b68fcd2bb05b6ddf/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L154-L160

If I removed the double check, loading time is fine again :

//Double-check if new metrics are registered
checkAndBindMetrics(registry);

With checkAndBindMetrics :

time curl http://kestra:2553/prometheus
real	0m6,209s
user	0m0,015s
sys	0m0,001s

Without checkAndBindMetrics :

time curl http://kestra:2553/prometheus
real	0m0,079s
user	0m0,013s
sys	0m0,013s

Since in Kafka metrics, there is a lot of gauge with a lot of tags (metrics x topic x partition x thread, 300 for my application), I think we must not do this check or used a cache if necessary (didn’t see any impact removing this check), this check will fetch all the metrics and use a synchronized methods and will be really slow, since call for every metrics.

Issue Analytics

State:
Created 3 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

2reactions

tchiotludocommented, Mar 27, 2020

Hello. It’s a simple stream application with this topology (no other process involve except the stream).

Topologies:
   Sub-topology: 0
    Source: KSTREAM-SOURCE-0000000000 (topics: [RAW-ESB-RECEIPTSYNC-V1])
      --> KSTREAM-PEEK-0000000001
    Processor: KSTREAM-PEEK-0000000001 (stores: [])
      --> KSTREAM-MAPVALUES-0000000002
      <-- KSTREAM-SOURCE-0000000000
    Processor: KSTREAM-MAPVALUES-0000000002 (stores: [])
      --> KSTREAM-MAPVALUES-0000000003
      <-- KSTREAM-PEEK-0000000001
    Processor: KSTREAM-MAPVALUES-0000000003 (stores: [])
      --> KSTREAM-FILTER-0000000004
      <-- KSTREAM-MAPVALUES-0000000002
    Processor: KSTREAM-FILTER-0000000004 (stores: [])
      --> KSTREAM-TRANSFORM-0000000005
      <-- KSTREAM-MAPVALUES-0000000003
    Processor: KSTREAM-TRANSFORM-0000000005 (stores: [])
      --> KSTREAM-PEEK-0000000006
      <-- KSTREAM-FILTER-0000000004
    Processor: KSTREAM-PEEK-0000000006 (stores: [])
      --> KSTREAM-MAPVALUES-0000000010, KSTREAM-SINK-0000000007
      <-- KSTREAM-TRANSFORM-0000000005
    Processor: KSTREAM-MAPVALUES-0000000010 (stores: [])
      --> KSTREAM-TRANSFORMVALUES-0000000011
      <-- KSTREAM-PEEK-0000000006
    Processor: KSTREAM-TRANSFORMVALUES-0000000011 (stores: [])
      --> KSTREAM-SINK-0000000012
      <-- KSTREAM-MAPVALUES-0000000010
    Sink: KSTREAM-SINK-0000000007 (topic: RAW-SALES-TICKET-V1)
      <-- KSTREAM-PEEK-0000000006
    Sink: KSTREAM-SINK-0000000012 (topic: EDM-SALES-CUSTOMERRECEIPT-V1)
      <-- KSTREAM-TRANSFORMVALUES-0000000011

  Sub-topology: 1 for global store (will not generate tasks)
    Source: KSTREAM-SOURCE-0000000008 (topics: [EDM-OFFERS-PRODUCT-V1])
      --> KTABLE-SOURCE-0000000009
    Processor: KTABLE-SOURCE-0000000009 (stores: [product_join])
      --> none
      <-- KSTREAM-SOURCE-0000000008

So : 5 topics with 24 partitions.

It’s always the same topics, no changed at all on this. The performance penalty is on all call, not only on first call.

In my app I’ve 258 gauge generated (x number of tags) = 8910 generated metrics from this kafka stream.

After analyse the checkAndBindMetrics(registry); is called on every collection of the gauge for every gauge, on all hit to the prometheus page (so 8910 called in my case)

since checkAndBindMetrics will collect all metrics and forEach all of them in a synchronized way, it mean that lead to foreach millions of entry at the end of the prometheus call.

1reaction

tchiotludocommented, Apr 6, 2020

Forgot the concurrent part, I discovered that no concurrent call is possible :

https://stackoverflow.com/a/35498230/1590168

This does not mean that there will not no concurrency over all scheduled tasks. Rather, for each task (created by invocation of scheduleAtFixedRate), the Runnable only executes on one thread at a time - even if the execution time overruns the interval.

So we can remove the synchonized without risk