question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RequestTimedOutError when polling with all partitions paused

See original GitHub issue

When I pause all my partitions and call poll I get a RequestTimedOutError after 40 seconds (presumably due to request.timeout.ms). This is then shortly followed by an Auto offset commit failed for group XYZ... error. I’m pausing all my partitions so I can perform a long “batch” operation and continue to poll the topic for heartbeat purposes so Kafka doesn’t think I’m dead but NOT actually fetch any messages. Once the batch operation is done I’ll resume the partitions. I’m essentially trying to implement a background heartbeat by interleaving a “no-op” poll into the batch operation on a regular interval (every 2 secs)

Why am I doing this? I want to load process state from a compacted topic during process startup. I also want to use Kafka’s automatic partition assignment. I use a consistent hash key for all my topics so the data in my partitions aligns from topic to topic. When a process starts it connects to a topic and gets its partition assignments. I then load state from the compacted topic by manually assigning the partitions and using the partition numbers corresponding with what was automatically assigned from the main input topic. I’ll read messages from the compacted topic up to a predetermined offset and call poll on the main topic on a regular 2 sec. interval. This seems like it should work. However, after 40 secs I get RequestTimedOutError, closing connection, etc. errors.

Should this work? Why would kafka-python be trying to commit the offset for a paused partition? What would be causing the RequestTimedOutError error? Is there an alternative way that I can accomplish what I describe above (I’m more than happy to try an alternative approach).

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
hburrowscommented, Nov 12, 2016

I think we’ve figured this out. Not sure if you’d consider it a bug or not? We were performing long running “batch” operations inside on_partitions_assigned of our ConsumerRebalanceListener subclass. The first call to on_partitions_assigned happens on the first poll and apparently before the HeartbeatTask is scheduled to run; I don’t believe the heartbeat task gets scheduled until after on_partitions_assigned returns . Thus… as we poll no heartbeats are ever sent.

Our solution is to remove any long running “batch” operations from the ConsumerRebalanceListener listeners and run them from within the main poll loop. The ConsumerRebalanceListener listeners now set a flag that the main poll loop inspects to signal a “batch” operation. This allows any internal “housekeeping” that is associated with a rebalance to occur before attempting to run the “batch” job. To keep the heartbeat active during a long running batch we 1) pause all partitions and then 2) regularly call poll with a timeout_ms of 100 (a timeout_ms of 0 or a value that is too small doesn’t result in a heartbeat request ???). We can also abort batch jobs so if a rebalance happens during a batch it’s immediately aborted and restarted. Generally seems to be working.

How kafka-python decides to send a heartbeat is a mystery and very complicated. There should be a predictable and transparent way to get a heartbeat request to be transmitted. The heartbeat is so integral to using automatic partition assignment you need to understand how to make it happen predictably.

0reactions
dpkpcommented, Nov 18, 2016

Glad to hear you found the error. Heartbeat management is definitely convoluted. I’ve copied the initial java client design, but they have since modified their approach to now use a separate background thread to manage heartbeats. It’s probably worth investigating a similar switch in kafka-python.

Read more comments on GitHub >

github_iconTop Results From Across the Web

API Documentation — aiokafka 0.8.0 documentation
Returns set of all known partitions for the topic. ... In Java client this behaviour is bound to max.poll.interval.ms configuration, but as aiokafka...
Read more >
Kafka Consumer: polling from a assigned partition that was ...
A partition of a topic will be assigned to one and only consumer from a consumer group. I'm not sure what you meant...
Read more >
Kafka Consumer Important Settings: Poll & Internal Threads ...
Once the consumer is subscribed to Kafka topics, the poll loop handles all details of coordination, partition rebalances, heartbeats, and data fetching, leaving ......
Read more >
kafka-python Documentation - Read the Docs
pause (*partitions). Suspend fetching from the requested partitions. Future calls to poll() will not return any records from these partitions until they have ......
Read more >
KafkaConsumer (kafka 0.10.1.1 API)
This is achieved by balancing the partitions between all members in the ... need to pause the partition so that no new records...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found