Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No exception raised when heartbeat timed out

See original GitHub issue

Our team is working on migrating an existing project that’s using the 0.12.0 version of pika to the latest 1.2.0. The project relies on the heartbeat timeout exception raised by pika to determine if the network is disconnected, which runs well with the 0.12.0 version but not the 1.2.0 as the exception seems no longer to be raised when the network is disconnected.

A code snippet used to reproduce this issue is as follows:


credentials = pika.PlainCredentials(<username>, <password>)
params = pika.ConnectionParameters(<server-ip>, <server-port>, '/', credentials, heartbeat=60)
data = json.dumps(<some-data>)
conn = pika.BlockingConnection(params)
chan = conn.channel()
chan.exchange_declare(
    exchange=<exchange-name>,
    exchange_type='topic',
    durable=False,
    auto_delete=True
)

while True:
    time.sleep(5)
    print('publishing...')
    chan.basic_publish(
        exchange=<exchange-name>,
        routing_key=<routing-key>,
        body=data
    )

By creating different virtual environments, pip installing different versions of pika, running the same code snippet, and then manually disconnecting the network (via unplugging the ethernet cable) after the loop gets entered (i.e. the first print line appears), our team got different results: the 0.12.0 version of pika will throw an exception that breaks the loop and terminates the program (that’s what we want), but with 1.2.0 the loop seems to run forever and never stop, the basic_publish() method gets called again and again but we believe it doesn’t have any effects anymore.

We have inspected the source code of 1.2.0 and found the HeartbeatChecker actually successfully detected the idle network but just cannot figure out why there is no exception emitted.

Our RabbitMQ server is running version 3.6.10 with Erlang 20.2.2.

Is this behavior a change intentionally made through 0.12.0 -> 1.2.0 or a bug created accidentally instead?

Any input is welcome!

Thank all contributors for your great efforts!

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (2 by maintainers)

Top GitHub Comments

5reactions

vitalis89commented, Nov 15, 2021

I’m pretty sure I encountered the same problem. After analyzing the code I see that after the HeartbeatChecker aborts the connection, the _flush_output of BlockingConnection stops running callbacks because of the following:

        # Conditions for terminating the processing loop:
        #   connection closed
        #         OR
        #   empty outbound buffer and no waiters
        #         OR
        #   empty outbound buffer and any waiter is ready
        is_done = (lambda:
                       self._closed_result.ready or
                   ((not self._impl._transport or
                     self._impl._get_write_buffer_size() == 0) and
                    (not waiters or any(ready() for ready in waiters))))

Since the checker causes no data to be aggregated in the buffer anymore, self._impl._get_write_buffer_size() == 0 returns True and also the waiters list contains only the lambda: True function (_ALWAYS_READY_WAITERS). For that reason the while loop doesn’t run:

        # Process I/O until our completion condition is satisfied
        while not is_done():
            self._impl.ioloop.poll()
            self._impl.ioloop.process_timeouts()

process_timeouts is responsible of popping callbacks from the _callbacks queue, the relevant callback _connection_lost_notify_async doesn’t run.

A possible solution is to issue an additional call to process_timeouts right after the while. It’s not the best solution though, probably this specific scenario can be identified, and irrelevant calls can be avoided.

1reaction

8fdafs2commented, Jul 5, 2021

I have narrowed down the update leading to this behavior change to the major one of 0.13.1 -> 1.0.0.

It seems the callbacks (one of which throws the wanted exception) will not be executed if the underlying asynchronous io-loop doesn’t tick by using BlockingChannel.start_consuming(), or BlockingConnection.process_data_events(), or BlockingConnection.sleep().

Top Results From Across the Web

Spark cluster full of heartbeat timeouts, executors exiting on ...

The answer was rather simple. In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to ...

Heartbeat lost[Ambari-agent] - Cloudera Community - 289531

Hi, Newbie here. Suddenly one of the nodes lost the heartbeat. Tried to restart ambari-agent and ambari-server. However, the error still persists.

Link timeout, no heartbeat in last 5 seconds - ArduCopter

Hello guys, Am getting this warning message every time I try to run my python code. This is my script on my raspberry...

Delivery acknowledgement timeouts after upgrading to 3.8.17

First of all i would like to say that increasing the heartbeat timeout does not solve the problem permanently (there is still the...

Abortion Opponents Hear a 'Heartbeat.' Most Experts Hear ...

The Texas law, which makes no exceptions for cases of rape or incest, forbids abortion at the time a “heartbeat” can be heard,...