Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kombu may crash while running BRPOP on a connection that Redis decided to close

See original GitHub issue

Versions

Kombu: 4.0.2 Celery: 4.0.2 redis-py: 2.10.5 Redis: 3.2.6 Python: 2.7.12

Steps to reproduce

Start a redis server with timeout 1, I use the following config:

daemonize yes
pidfile ./redis.pid
port 0
unixsocket /tmp/celery.redis.test.sock
unixsocketperm 755
timeout 1
loglevel notice
logfile ./redis.log
databases 1

(I shove the unix socket into /tmp/ because there’s a limit to how long a unix socket path can be.)

I put the Celery configuration in celeryconfig.py:

CELERY_BROKER_URL = 'redis+socket:///tmp/celery.redis.test.sock'
CELERY_RESULT_BACKEND = CELERY_BROKER_URL

The Celery app is in tasks.py:

import sys
import time

from celery import Celery, Task

app = Celery('tasks')
app.config_from_object('celeryconfig', namespace='CELERY')

@app.task
def add(x, y):
    return x + y

I have test.py:

from celery.bin.multi import MultiTool

from tasks import app, add

workers = [
    "A",
    "B"
]

while True:
    for worker in workers:
        retcode = MultiTool().execute_from_commandline(["multi", "start", "-A",
                                                        "tasks", worker])
        print "STARTED {0} WITH {1}".format(worker, retcode)

    print "PING AFTER START", app.control.inspect().ping()
    print add.delay(1, 2).get()
    print "PING AFTER TASK", app.control.inspect().ping()
    for worker in workers:

        retcode = MultiTool().execute_from_commandline(["multi", "stopwait", "-A",
                                                        "tasks", worker])
        print "STOPPED {0} WITH {1}".format(worker, retcode)
    print "PING AFTER STOP", app.control.inspect().ping()

Run python test.py.

Expected Results

Should run forever without a crash.

Actual Results

Crashes at the end of the first iteration:

PING AFTER STOP
Traceback (most recent call last):
  File "test.py", line 24, in <module>
    print "PING AFTER STOP", app.control.inspect().ping()
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/celery/app/control.py", line 113, in ping
    return self._request('ping')
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/celery/app/control.py", line 81, in _request
    timeout=self.timeout, reply=True,
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/celery/app/control.py", line 436, in broadcast
    limit, callback, channel=channel,
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/pidbox.py", line 321, in _broadcast
    channel=chan)
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/pidbox.py", line 360, in _collect
    self.connection.drain_events(timeout=timeout)
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/virtual/base.py", line 961, in drain_events
    get(self._deliver, timeout=timeout)
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 352, in get
    self._register_BRPOP(channel)
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 301, in _register_BRPOP
    channel._brpop_start()
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 707, in _brpop_start
    self.client.connection.send_command('BRPOP', *keys)
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/redis/connection.py", line 556, in send_packed_command
    (errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.

Observations

The code above in test.py imitates a sequence of operations that happen while testing an actual Django application of mine. The test suite starts and stops workers and executes some tasks on them.

Though the problem surfaced after I upgraded to Celery 4.x and thus started uising Kombu 4.x, a summary inspection of Kombu’s source code in the 3.x series suggests the problem is present there too. It is unclear to me why I did not run into it in when running Kombu 3.x.

In my actual Redis setup I do not use a timeout 1 setting. Using timeout 1 is an easy way to cause Redis to close a connection. Other situations may be because the tcp-keepalive timeout deemed a connection “dead”. Or because a client violated an output buffer limit, or for some other reason. What is clear is that Redis clients should be resilient in the face of connections that got closed by the server.

Indeed, the code of redis-py will generally retry sending commands that fail. This can be seen in execute_command:

    def execute_command(self, *args, **options):
        "Execute a command and return a parsed response"
        pool = self.connection_pool
        command_name = args[0]
        connection = pool.get_connection(command_name, **options)
        try:
            connection.send_command(*args)
            return self.parse_response(connection, command_name, **options)
        except (ConnectionError, TimeoutError) as e:
            connection.disconnect()
            if not connection.retry_on_timeout and isinstance(e, TimeoutError):
                raise
            connection.send_command(*args)
            return self.parse_response(connection, command_name, **options)
        finally:
            pool.release(connection)

Kombu’s Redis Channel will generally benefit from redis-py’s automatic retrying because most of the time the methods it calls on its client ultimately run execute_command. However, _brpop_start calls self.client.connection.send_command('BRPOP', *keys). If this call fails, the failure is automatically sent up the stack without a retry.

I’ve been able to work around the issue by changing the code of _brpop_start to:

    def _brpop_start(self, timeout=1):
        queues = self._queue_cycle.consume(len(self.active_queues))
        if not queues:
            return
        keys = [self._q_for_pri(queue, pri) for pri in self.priority_steps
                for queue in queues] + [timeout or 0]
        self._in_poll = self.client.connection
        from redis.exceptions import ConnectionError
        try:
            self.client.connection.send_command('BRPOP', *keys)
        except ConnectionError:
            self.client.connection.send_command('BRPOP', *keys)

Issue Analytics

State:
Created 7 years ago
Reactions:4
Comments:14 (3 by maintainers)

Top GitHub Comments

1reaction

lddubeaucommented, Mar 19, 2019

I revisited this issue recently and spent quite a bit of time re-examining the problem. The BRPOP failure I experienced is really one manifestation of a larger problem:

Redis may close connections unilaterally.
redis-py sometimes reopens the connection automatically.
On those occasions when redis-py does not reopen the connection a subsequent attempt at using the connection causes EPIPE which causes an exception to be reported up the call stack.

My initial issue was with BRPOP but the problem can happen whenever Celery (through Kombu) tries to contact Redis. My recent brush with this problem actually happened while Celery (in Celery code proper, not through Kombu) was trying to subscribe to a pubsub channel on Redis: the connection had been closed and the subscription failed. Retrying the subscription after the failure worked just fine.

When I filed my initial report, redis-py was coded in a way that made execute_command almost always retry failed attempts at sending a command to Redis. It turns out that the retry logic in redis-py was wrong and has only been fixed earlier this year. When I did my tests with the code of redis-py as it existed when I filed this issue, I could see retries happening regularly. Prior to the fix, if you used redis-py in its default configuration and the retry code faced a closed connection, it would always automatically reopen it. Once the retry logic in redis-py has been fixed, if you use redis-py in its default configuration and the retry code faces a closed connection, the connection is not automatically reopened . The retry logic is now such that it retries only if the socket timed out (in the socket.timeout sense of “timed out”) and the flag retry_on_timeout is true.

Although the new retry logic (by default) won’t cause redis-py to automatically reconnect when Redis unilaterally closes a connection, there are other parts of redis-py that automatically resurrect closed connections. The get_connection method provided with the default ConnectionPool, contains logic that reconnects disconnected connections that are still in the pool. These automatic reconnections make it difficult to diagnose problems that Redis disconnections cause with Kombu or Celery. Whether the disconnection is hidden by redis-py or not depends on timing, and explains why code that worked fine for the longest time suddenly throws exceptions on EPIPE.

0reactions

mlissnercommented, Jun 20, 2019

Thanks for the explainer, @ lddubeau. Did you ever figure out a solution? I’ve got a script that calls:

task.get()

And it almost always fails. I’m pretty stumped on how to make that do an automatic retry (assuming that’s the right thing to make it do).

Top Results From Across the Web

Redis loses data when data is pushed after blocking (BRPOP ...

When I create a new Redis connection and I send `BRPOP` command I terminate the client. ... You may adjust your issue notification...

Change history — Kombu 5.2.4 documentation - Celery

Prevent event loop polling on closed redis transports (and causing leak). ... This causes Celery to crash when the broker_pool_limit configuration option is ......

Redis client handling

When Redis can't accept a new client connection because the maximum number of clients has been reached, it tries to send an error...

Troubleshooting - Amazon ElastiCache for Redis

If clients and the ElastiCache cluster are in different subnets, make sure that their route tables allow them to reach each other. More...

Could not connect to Redis at 127.0.0.1:6379 - Stack Overflow

Thankfully the log made the solution above obvious. Can't I just run redis-server ? You sure can. It'll just take up a terminal...