Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RequestTimedOutError when polling with all partitions paused

See original GitHub issue

When I pause all my partitions and call poll I get a RequestTimedOutError after 40 seconds (presumably due to request.timeout.ms). This is then shortly followed by an Auto offset commit failed for group XYZ... error. I’m pausing all my partitions so I can perform a long “batch” operation and continue to poll the topic for heartbeat purposes so Kafka doesn’t think I’m dead but NOT actually fetch any messages. Once the batch operation is done I’ll resume the partitions. I’m essentially trying to implement a background heartbeat by interleaving a “no-op” poll into the batch operation on a regular interval (every 2 secs)

Why am I doing this? I want to load process state from a compacted topic during process startup. I also want to use Kafka’s automatic partition assignment. I use a consistent hash key for all my topics so the data in my partitions aligns from topic to topic. When a process starts it connects to a topic and gets its partition assignments. I then load state from the compacted topic by manually assigning the partitions and using the partition numbers corresponding with what was automatically assigned from the main input topic. I’ll read messages from the compacted topic up to a predetermined offset and call poll on the main topic on a regular 2 sec. interval. This seems like it should work. However, after 40 secs I get RequestTimedOutError, closing connection, etc. errors.

Should this work? Why would kafka-python be trying to commit the offset for a paused partition? What would be causing the RequestTimedOutError error? Is there an alternative way that I can accomplish what I describe above (I’m more than happy to try an alternative approach).

Issue Analytics

State:
Created 7 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

2reactions

hburrowscommented, Nov 12, 2016

I think we’ve figured this out. Not sure if you’d consider it a bug or not? We were performing long running “batch” operations inside on_partitions_assigned of our ConsumerRebalanceListener subclass. The first call to on_partitions_assigned happens on the first poll and apparently before the HeartbeatTask is scheduled to run; I don’t believe the heartbeat task gets scheduled until after on_partitions_assigned returns . Thus… as we poll no heartbeats are ever sent.

Our solution is to remove any long running “batch” operations from the ConsumerRebalanceListener listeners and run them from within the main poll loop. The ConsumerRebalanceListener listeners now set a flag that the main poll loop inspects to signal a “batch” operation. This allows any internal “housekeeping” that is associated with a rebalance to occur before attempting to run the “batch” job. To keep the heartbeat active during a long running batch we 1) pause all partitions and then 2) regularly call poll with a timeout_ms of 100 (a timeout_ms of 0 or a value that is too small doesn’t result in a heartbeat request ???). We can also abort batch jobs so if a rebalance happens during a batch it’s immediately aborted and restarted. Generally seems to be working.

How kafka-python decides to send a heartbeat is a mystery and very complicated. There should be a predictable and transparent way to get a heartbeat request to be transmitted. The heartbeat is so integral to using automatic partition assignment you need to understand how to make it happen predictably.

0reactions

dpkpcommented, Nov 18, 2016

Glad to hear you found the error. Heartbeat management is definitely convoluted. I’ve copied the initial java client design, but they have since modified their approach to now use a separate background thread to manage heartbeats. It’s probably worth investigating a similar switch in kafka-python.

Top Results From Across the Web

API Documentation — aiokafka 0.8.0 documentation

Returns set of all known partitions for the topic. ... In Java client this behaviour is bound to max.poll.interval.ms configuration, but as aiokafka...

Kafka Consumer: polling from a assigned partition that was ...

A partition of a topic will be assigned to one and only consumer from a consumer group. I'm not sure what you meant...

Kafka Consumer Important Settings: Poll & Internal Threads ...

Once the consumer is subscribed to Kafka topics, the poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving ......

kafka-python Documentation - Read the Docs

pause (*partitions). Suspend fetching from the requested partitions. Future calls to poll() will not return any records from these partitions until they have ......

KafkaConsumer (kafka 0.10.1.1 API)

This is achieved by balancing the partitions between all members in the ... need to pause the partition so that no new records...