question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Producer high CPU utilization

See original GitHub issue

Description

We have some producers on production that are causing high CPU usage (60% of cpu usage on the service). Our current throughput is about 200k messages/300MB per second image.

  • Confluent kafka nuget version

Confluent.Kafka Version=“1.0.1.1” librdkafka.redist Version=“1.0.1”

  • Client configuration
var bootstrapServers = "x";
var clientConfig = new ClientConfig
{
  BootstrapServers = bootstrapServers
};
var producerConfig = new ProducerConfig(clientConfig);
  • Operating system

We are running on a Windows Machine (Windows Server 2016 Datacenter) with Dotnet Core 3.0.

How to reproduce

We are using one of our abstractions of queue of our library https://github.com/takenet/elephant/blob/master/src/Take.Elephant.Kafka/KafkaSenderQueue.cs

And we are using for producing this EnqueueAsync method.

Checklist

Please provide the following information:

  • A complete (i.e. we can run it), minimal program demonstrating the problem. No need to supply a project file.
  • Confluent.Kafka nuget version.
  • Apache Kafka version.
  • Client configuration.
  • Operating system.
  • Provide logs (with “debug” : “…” as necessary in configuration).
  • Provide broker log excerpts.
  • Critical issue.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:14 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
mhowlettcommented, Feb 18, 2020

poll blocks until librdkafka is ready to inform the application of a new event. it doesn’t busy-wait, so won’t result in high-CPU if there are no events. you don’t need to call poll for every produce call, just periodically. when it’s called, callbacks are executed corresponding to every currently outstanding event (if a corresponding callback is registered, else the event is dropped). events include delivery callback notifications, or error or log events. this all happens behind the scenes in the .net library.

1reaction
j2jensencommented, May 2, 2023

I have an interesting update on this.

I identified an issue in the application that accounted for the second-highest hotspot on the hotspot image above (TimerQueueTimer.Fire: 8.9%). Here’s what happened after releasing the fix:

image

The entire issue I reported got resolved by addressing something which Dynatrace thought was a tiny part of the overall problem.

Now the same service is showing 95% of its time in the same StartPollTask hotspot. Looking at other services that use Kafka, I’m seeing similarly high percentages as well. But if you do the math, it doesn’t add up. Before Kafka accounted for 75% of 20% total CPU usage, so supposedly 15% of total CPU. Now, it accounts for 95% of 5% total usage, so ~4.8%? And the only change made had nothing to do with Kafka. That doesn’t make sense.

My hypothesis: Dynatrace does its hotspot evaluation, not by knowing what’s actually using CPU time, but by evaluating stack traces and seeing where the code appears to be sitting. Since Kafka has a single, persistent background thread that’s almost always sitting (effectively blocked) on this line of code at the boundary between .NET and the unmanaged librdkakfa library, it gives Dynatrace the impression that this line of code is a hotspot, when in reality it’s hardly accounting for any of the actual CPU usage.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to find the root cause of high CPU usage of Kafka ...
We believe the snappy compression is the reason for the high cpu usage because it has to decompress all messages sent by our...
Read more >
How to troubleshoot 100% CPU consumption on production
CPU high consumption can occur from various factors, some of them are listed below: 1. Custom code/method can consume more CPU depending on...
Read more >
Re: Apache Kafka Process showing high CPU (100 to 200 ...
Re: Apache Kafka Process showing high CPU (100 to 200+) usage in Linux when Idle. Controller is not running in this node. We...
Read more >
How to Fix High CPU Usage
Find out all the reasons why your PC displays high CPU usage. Our step-by-step guide will show you how to fix your CPU...
Read more >
Producer CPU consumption vs Consumer CPU consumption
The docs don't say that a Producer doesn't consume CPU and we cannot document how much CPU each entity uses. Basically because it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found