Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High CPU usage in KafkaConsumer.poll() when subscribed to many topics with no new messages (possibly SSL related)

See original GitHub issue

Experiencing high CPU usage when sitting idle in poll() (i.e., waiting for a timeout when there are no new messages on the broker). Gets worse the more topics I am subscribed to (I have cpu pegged at 100 with 40 topics). Note that I am using 1.3.4 with mostly default configs, and repro’d also in the curret master.

Seems to be a couple things at play here. One is that poll() will do fetch requests in a tight loop. The other, the one that really seems to be killing cpu, is that when a fetch response is received, the low level poll() will get in a relatively tight loop as the payload buffer fills, adding a relatively small number of bytes at a time. This explains the effect of adding more topics: the fetch responses are bigger so it more time in this tight loop. Here’s some debug output based on a couple probes I put in the code:

In conn.py: _recv()

        if staged_bytes != self._next_payload_bytes:
                print("staged: {}   payload: {}".format(staged_bytes, self._next_payload_bytes))
                return None

In consumer/group.py: _poll_once()

        print("fetch!")
        # Send any new fetches (won't resend pending fetches)
        self._fetcher.send_fetches()

So, for one topic I get output like this while blocked in poll():

fetch!
staged: 4   payload: 104
fetch!
staged: 12   payload: 104
fetch!
staged: 50   payload: 104
fetch!
staged: 68   payload: 104
fetch!
staged: 86   payload: 104
fetch!
fetch!
staged: 4   payload: 104

For 2 topics:

fetch!
staged: 4   payload: 179
fetch!
staged: 12   payload: 179
fetch!
staged: 51   payload: 179
fetch!
staged: 69   payload: 179
fetch!
staged: 87   payload: 179
fetch!
staged: 105   payload: 179
fetch!
staged: 143   payload: 179
fetch!
staged: 161   payload: 179
fetch!
fetch!
staged: 4   payload: 197
fetch!

For 40 topics:

fetch!
staged: 2867   payload: 3835
fetch!
staged: 2885   payload: 3835
fetch!
staged: 2939   payload: 3835
fetch!
staged: 2957   payload: 3835
fetch!
staged: 2975   payload: 3835
staged: 4   payload: 3799
fetch!
staged: 12   payload: 3799
fetch!
staged: 58   payload: 3799
fetch!
staged: 76   payload: 3799
fetch!
staged: 94   payload: 3799
fetch!
staged: 112   payload: 3799
fetch!
staged: 154   payload: 3799
fetch!
... and many mnay more

so it gets stuck spinning in this, and cpu goes to 100.

I tried mitigating this using consumer fetch config:

    fetch_min_bytes=1000000,
    fetch_max_wait_ms=2000,

but that did nothing.

The only thing that gets the cpu down is to to a non-blocking poll() instead of using a timeout, and then doing a short sleep when there are no result records (my application can tolerate that latency). It looks like poll used to support something like this, i.e., there was a sleep parameter that caused a sleep for the remainder of the timeout period if there were no records on first fetch. Looks like that was removed in 237bd73, not sure why.

So… like I said I can workaround the continuous fetching with my own sleep. Would be good to understand the real problem which is the tight _recv() loop, and whether anything can be done about it.

Issue Analytics

State:
Created 6 years ago
Comments:25 (10 by maintainers)

Top GitHub Comments

1reaction

rmechlercommented, Dec 13, 2017

Ok, I kind of know what’s going now. My test server was sending 16K in a single write. When I make it respond with multiple small writes of 50 bytes each, the client recv() returns 50 bytes at a time. Probably what is happening here (and I am speculating a bit as I am not intimately familiar with the SSL protocol) is that each write from server is encoded as a separate SSL record, and on the client side recv() for an SSL socket will return 1 record at a time.

It looks like the Kafka broker is sending the FetchResponse as a bunch of small writes. In fact it looks like it is doing a write for each topic name, and a write for each (empty) message set. That explains the pattern of 18 byte reads (the message sets) and somewhat larger reads (the topics). The reason I see 30 byte reads when using the master code is that it is using the newer protocol. If I specify api_version=(0,10), it goes back to 18 bytes.

So, I’m not sure there is much that can be done about getting the small chunks of data from recv(). Do you think it might be possible to safely assemble the response in a tighter loop, without going back through a select() call every time, in order to make it more efficient?

0reactions

rmechlercommented, Jan 4, 2018

Thanks, I probably should have opened a separate ticket myself. And thanks for getting a fix in for the CPU issue!

Top Results From Across the Web

High CPU issue during rebalance in Kafka consumer after ...

But when using cooperative one, it seems the consumers threads are stuck on something and couldn't finish the rebalance so the high CPU...

Kafka Consumer Configurations for Confluent Platform

This topic provides the configuration parameters that are available for Confluent Platform. The Apache Kafka® consumer configuration parameters are organized by ...

Kafka Consumer Important Settings: Poll & Internal Threads ...

Once the consumer is subscribed to Kafka topics, the poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving ......

Apache Kafka Reference Guide - Quarkus

orElseThrow(); // process the message payload. double price = msg. ... Pub/Sub: Multiple consumer groups subscribed to a topic.

Consumer not receiving messages, kafka console, new ...

I my MAC box I was facing the same issue of console-consumer not consuming any messages when used the command kafka-console-consumer --bootstrap-server ...