question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

When RabbitMQ is down calling a task is stuck forever (retry settings are ignored)

See original GitHub issue

#735 was closed but the problem I mentioned in https://github.com/celery/kombu/issues/735#issuecomment-327157158 was not resolved.

Recap: If trying to call a task when RabbitMQ is down, the call will be stuck forever, ignoring my retry_policy or retry=False.

Here’s a minimal Python 2.7 example to reproduce the issue:

# tasks.py
import time

from celery import Celery

app = Celery('tasks', broker='amqp://guest@localhost//')

@app.task
def some_long_computation():
    print('In some_long_computation, about to sleep for 2 seconds')
    time.sleep(2)
    print('Exiting some_long_computation')

if __name__ == '__main__':
    print('>>> About to call some_long_computation task')
    # Disable retry:
    some_long_computation.apply_async(retry=False)
    # Or with a retry policy:
    # some_long_computation.apply_async(retry=True, retry_policy={
    #     'max_retries': 3, 'interval_start': 0, 'interval_step': 0.2, 'interval_max': 0.2})
    print('>>> After calling some_long_computation task')

Run the worker:

celery -A tasks worker --loglevel=info

Execute the task in another shell/session:

python tasks.py

The task will complete and exit successfully.

Now, stop RabbitMQ using:

sudo service rabbitmq-server stop

Execute the task again, and you will see it’s stuck, even though the call passed retry=False and it should have thrown an exception.


I re-tested it with celery and kombu 4.2.1 and 4.2.0 and in both versions it remained stuck and didn’t respect retry=False.

Supplying a retry policy also didn’t work (see the commented-out example in the code above).

I tested this with python 2.7.15 on Ubuntu 18.04.1. My test environment pip freeze:

amqp==2.3.2
billiard==3.5.0.4
celery==4.2.1
kombu==4.2.1
pytz==2018.5
vine==1.1.4

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:4
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
srikirajucommented, Jan 28, 2019

Bump on this issue. This is fairly important to not blow up everyones servers in the event of some intermittent downtime. Luckily, we caught this in a unit test.

1reaction
etiktincommented, Jul 30, 2018

Note that the original monkey patch I created at https://github.com/celery/kombu/issues/735#issuecomment-327595747 , still works but you need to update the kombu version check.

Here’s the updated code:

def monkey_patch_kombu(retry=True, retry_policy=None):
    """
    Applies a fix for producer being stuck forever when trying to publish a
    message. See details at: https://github.com/celery/kombu/issues/735 and 
    https://github.com/celery/kombu/issues/902
    :param bool retry: decides if publishing messages will be retried in the 
    case of connection loss or other connection errors (see 
    `task_publish_retry` in Celery's docs)
    :param dict retry_policy: defines the default policy when retrying
    publishing a task message in the case of connection loss or other
    connection errors (see `task_publish_retry_policy` in Celery's docs)
    """
    import kombu
    assert kombu.__version__ == '4.2.1', 'Check if patch is still needed'
    from kombu import Connection

    if retry_policy is None:
        retry_policy = dict(max_retries=4, interval_start=0,
            interval_step=10, interval_max=10)

    if not retry or retry_policy['max_retries'] == 0:
        # Disable retries
        # Note: we use -1 instead of 0, because the retry logic in
        # kombu/utils/functional.py `retry_over_time` function checks if
        # max_retries is "truthy" before checking if the current number of
        # retries passed max_retries, so 0 won't work, but -1 will
        retry_policy['max_retries'] = -1

    @property
    def patched_default_channel(self):
        self.ensure_connection(**retry_policy)

        if self._default_channel is None:
            self._default_channel = self.channel()
        return self._default_channel

    # Patch/replace the connection module default_channel property
    Connection.default_channel = patched_default_channel

Usage: call this code once, before publishing and your retry policy should be respected (if none is provided, the default is used - i.e. up-to 3 retries).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling long running tasks in pika / RabbitMQ - Stack Overflow
For now, your best bet is to turn off heartbeats, this will keep RabbitMQ from closing the connection if you're blocking for too...
Read more >
Reliability Guide - RabbitMQ
Reliability Guide. Overview. This guides provides an overview features of RabbitMQ, AMQP 0-9-1 and other supported protocols related to data safety.
Read more >
What's new in Celery 4.0 (latentcall)
The celery worker command now ignores the --no-execv , --force-execv , and the CELERYD_FORCE_EXECV setting. This flag will be removed completely in 5.0...
Read more >
Celery Documentation - Read the Docs
A task queue's input is a unit of work called a task. ... latency (using RabbitMQ, librabbitmq, and optimized settings). • Flexible.
Read more >
All about of RabbitMQ Consumer-Side Failover | by Thanh Trinh
Stateless retry interceptor: Which will do all retries within the thread (using Thread.sleep() ) without rejecting the message on each retry (so ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found