Kombu may crash while running BRPOP on a connection that Redis decided to close
See original GitHub issueVersions
Kombu: 4.0.2 Celery: 4.0.2 redis-py: 2.10.5 Redis: 3.2.6 Python: 2.7.12
Steps to reproduce
Start a redis
server with timeout 1
, I use the following config:
daemonize yes
pidfile ./redis.pid
port 0
unixsocket /tmp/celery.redis.test.sock
unixsocketperm 755
timeout 1
loglevel notice
logfile ./redis.log
databases 1
(I shove the unix socket into /tmp/
because there’s a limit to how long a unix socket path can be.)
I put the Celery configuration in celeryconfig.py
:
CELERY_BROKER_URL = 'redis+socket:///tmp/celery.redis.test.sock'
CELERY_RESULT_BACKEND = CELERY_BROKER_URL
The Celery app is in tasks.py
:
import sys
import time
from celery import Celery, Task
app = Celery('tasks')
app.config_from_object('celeryconfig', namespace='CELERY')
@app.task
def add(x, y):
return x + y
I have test.py
:
from celery.bin.multi import MultiTool
from tasks import app, add
workers = [
"A",
"B"
]
while True:
for worker in workers:
retcode = MultiTool().execute_from_commandline(["multi", "start", "-A",
"tasks", worker])
print "STARTED {0} WITH {1}".format(worker, retcode)
print "PING AFTER START", app.control.inspect().ping()
print add.delay(1, 2).get()
print "PING AFTER TASK", app.control.inspect().ping()
for worker in workers:
retcode = MultiTool().execute_from_commandline(["multi", "stopwait", "-A",
"tasks", worker])
print "STOPPED {0} WITH {1}".format(worker, retcode)
print "PING AFTER STOP", app.control.inspect().ping()
Run python test.py
.
Expected Results
Should run forever without a crash.
Actual Results
Crashes at the end of the first iteration:
PING AFTER STOP
Traceback (most recent call last):
File "test.py", line 24, in <module>
print "PING AFTER STOP", app.control.inspect().ping()
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/celery/app/control.py", line 113, in ping
return self._request('ping')
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/celery/app/control.py", line 81, in _request
timeout=self.timeout, reply=True,
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/celery/app/control.py", line 436, in broadcast
limit, callback, channel=channel,
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/pidbox.py", line 321, in _broadcast
channel=chan)
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/pidbox.py", line 360, in _collect
self.connection.drain_events(timeout=timeout)
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/virtual/base.py", line 961, in drain_events
get(self._deliver, timeout=timeout)
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 352, in get
self._register_BRPOP(channel)
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 301, in _register_BRPOP
channel._brpop_start()
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/kombu/transport/redis.py", line 707, in _brpop_start
self.client.connection.send_command('BRPOP', *keys)
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/redis/connection.py", line 563, in send_command
self.send_packed_command(self.pack_command(*args))
File "/home/ldd/src/celery_issues/celery_issue_2/.venv/local/lib/python2.7/site-packages/redis/connection.py", line 556, in send_packed_command
(errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.
Observations
The code above in test.py
imitates a sequence of operations that happen while testing an actual Django application of mine. The test suite starts and stops workers and executes some tasks on them.
Though the problem surfaced after I upgraded to Celery 4.x and thus started uising Kombu 4.x, a summary inspection of Kombu’s source code in the 3.x series suggests the problem is present there too. It is unclear to me why I did not run into it in when running Kombu 3.x.
In my actual Redis setup I do not use a timeout 1
setting. Using timeout 1
is an easy way to cause Redis to close a connection. Other situations may be because the tcp-keepalive
timeout deemed a connection “dead”. Or because a client violated an output buffer limit, or for some other reason. What is clear is that Redis clients should be resilient in the face of connections that got closed by the server.
Indeed, the code of redis-py
will generally retry sending commands that fail. This can be seen in execute_command
:
def execute_command(self, *args, **options):
"Execute a command and return a parsed response"
pool = self.connection_pool
command_name = args[0]
connection = pool.get_connection(command_name, **options)
try:
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
except (ConnectionError, TimeoutError) as e:
connection.disconnect()
if not connection.retry_on_timeout and isinstance(e, TimeoutError):
raise
connection.send_command(*args)
return self.parse_response(connection, command_name, **options)
finally:
pool.release(connection)
Kombu’s Redis Channel
will generally benefit from redis-py
’s automatic retrying because most of the time the methods it calls on its client ultimately run execute_command
. However, _brpop_start
calls self.client.connection.send_command('BRPOP', *keys)
. If this call fails, the failure is automatically sent up the stack without a retry.
I’ve been able to work around the issue by changing the code of _brpop_start
to:
def _brpop_start(self, timeout=1):
queues = self._queue_cycle.consume(len(self.active_queues))
if not queues:
return
keys = [self._q_for_pri(queue, pri) for pri in self.priority_steps
for queue in queues] + [timeout or 0]
self._in_poll = self.client.connection
from redis.exceptions import ConnectionError
try:
self.client.connection.send_command('BRPOP', *keys)
except ConnectionError:
self.client.connection.send_command('BRPOP', *keys)
Issue Analytics
- State:
- Created 7 years ago
- Reactions:4
- Comments:14 (3 by maintainers)
Top GitHub Comments
I revisited this issue recently and spent quite a bit of time re-examining the problem. The
BRPOP
failure I experienced is really one manifestation of a larger problem:Redis may close connections unilaterally.
redis-py
sometimes reopens the connection automatically.On those occasions when
redis-py
does not reopen the connection a subsequent attempt at using the connection causesEPIPE
which causes an exception to be reported up the call stack.My initial issue was with
BRPOP
but the problem can happen whenever Celery (through Kombu) tries to contact Redis. My recent brush with this problem actually happened while Celery (in Celery code proper, not through Kombu) was trying to subscribe to a pubsub channel on Redis: the connection had been closed and the subscription failed. Retrying the subscription after the failure worked just fine.When I filed my initial report,
redis-py
was coded in a way that madeexecute_command
almost always retry failed attempts at sending a command to Redis. It turns out that the retry logic inredis-py
was wrong and has only been fixed earlier this year. When I did my tests with the code ofredis-py
as it existed when I filed this issue, I could see retries happening regularly. Prior to the fix, if you usedredis-py
in its default configuration and the retry code faced a closed connection, it would always automatically reopen it. Once the retry logic inredis-py
has been fixed, if you useredis-py
in its default configuration and the retry code faces a closed connection, the connection is not automatically reopened . The retry logic is now such that it retries only if the socket timed out (in thesocket.timeout
sense of “timed out”) and the flagretry_on_timeout
is true.Although the new retry logic (by default) won’t cause
redis-py
to automatically reconnect when Redis unilaterally closes a connection, there are other parts ofredis-py
that automatically resurrect closed connections. Theget_connection
method provided with the defaultConnectionPool
, contains logic that reconnects disconnected connections that are still in the pool. These automatic reconnections make it difficult to diagnose problems that Redis disconnections cause with Kombu or Celery. Whether the disconnection is hidden byredis-py
or not depends on timing, and explains why code that worked fine for the longest time suddenly throws exceptions onEPIPE
.Thanks for the explainer, @ lddubeau. Did you ever figure out a solution? I’ve got a script that calls:
And it almost always fails. I’m pretty stumped on how to make that do an automatic retry (assuming that’s the right thing to make it do).