RequestTimedOutError when polling with all partitions paused
See original GitHub issueWhen I pause all my partitions and call poll
I get a RequestTimedOutError
after 40 seconds (presumably due to request.timeout.ms
). This is then shortly followed by an Auto offset commit failed for group XYZ...
error. I’m pausing all my partitions so I can perform a long “batch” operation and continue to poll the topic for heartbeat purposes so Kafka doesn’t think I’m dead but NOT actually fetch any messages. Once the batch operation is done I’ll resume the partitions. I’m essentially trying to implement a background heartbeat by interleaving a “no-op” poll into the batch operation on a regular interval (every 2 secs)
Why am I doing this? I want to load process state from a compacted topic during process startup. I also want to use Kafka’s automatic partition assignment. I use a consistent hash key for all my topics so the data in my partitions aligns from topic to topic. When a process starts it connects to a topic and gets its partition assignments. I then load state from the compacted topic by manually assigning the partitions and using the partition numbers corresponding with what was automatically assigned from the main input topic. I’ll read messages from the compacted topic up to a predetermined offset and call poll on the main topic on a regular 2 sec. interval. This seems like it should work. However, after 40 secs I get RequestTimedOutError
, closing connection, etc. errors.
Should this work? Why would kafka-python be trying to commit the offset for a paused partition? What would be causing the RequestTimedOutError
error? Is there an alternative way that I can accomplish what I describe above (I’m more than happy to try an alternative approach).
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (3 by maintainers)
I think we’ve figured this out. Not sure if you’d consider it a bug or not? We were performing long running “batch” operations inside
on_partitions_assigned
of ourConsumerRebalanceListener
subclass. The first call toon_partitions_assigned
happens on the first poll and apparently before theHeartbeatTask
isscheduled
to run; I don’t believe the heartbeat task gets scheduled until afteron_partitions_assigned
returns . Thus… as we poll no heartbeats are ever sent.Our solution is to remove any long running “batch” operations from the
ConsumerRebalanceListener
listeners and run them from within the main poll loop. TheConsumerRebalanceListener
listeners now set a flag that the main poll loop inspects to signal a “batch” operation. This allows any internal “housekeeping” that is associated with a rebalance to occur before attempting to run the “batch” job. To keep the heartbeat active during a long running batch we 1) pause all partitions and then 2) regularly callpoll
with atimeout_ms
of 100 (atimeout_ms
of 0 or a value that is too small doesn’t result in a heartbeat request ???). We can also abort batch jobs so if a rebalance happens during a batch it’s immediately aborted and restarted. Generally seems to be working.How kafka-python decides to send a heartbeat is a mystery and very complicated. There should be a predictable and transparent way to get a heartbeat request to be transmitted. The heartbeat is so integral to using automatic partition assignment you need to understand how to make it happen predictably.
Glad to hear you found the error. Heartbeat management is definitely convoluted. I’ve copied the initial java client design, but they have since modified their approach to now use a separate background thread to manage heartbeats. It’s probably worth investigating a similar switch in kafka-python.