ChannelFull exception crashing workers during backlog
See original GitHub issueWe recently encountered ChannelFull worker crashes while clearing ~15 minutes of messages (after the runworker
processes had been temporarily offline).
At the time there were approximately 20,000 keys in the appropriate redis db and we were using the default channel capacity of 100.
After research it seems like the suggested solution to clear the backlog is to increase the number of workers – so we did. They proceeded to crash with the included stack trace and didn’t help to process messages any faster.
We found ourselves in a situation where the only way to get things operating correctly was to pull the plug on new incoming WebSocket connections. This is because the ChannelFull error was crashing workers which, in turn, means that the channels weren’t actually being cleared (leading the more crashes and so on).
At the time we had 32 worker processes across a number of machines attempting to catch up.
Is this expected behaviour for the workers to crash like this, and how could we mitigate similar problems in the future?
Setup
- Nginx proxying to upstream daphne running in containers
- Using channels to service only WebSocket requests
asgi_redis.RedisSentinelChannelLayer
backend- Running
runworker
via supervisor on a number of machines
Versions
asgi-redis==1.3.0
channels==1.1.3
daphne==1.2.0
django==1.11.1
Twisted==17.1.0
Traceback
Traceback (most recent call last):
File "/home/team/releases/current/manage.py", line 9, in <module>
execute_from_command_line(sys.argv)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 363, in execute_from_command_line
utility.execute()
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 355, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/base.py", line 283, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/management/commands/runworker.py", line 83, in handle
worker.run()
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/worker.py", line 151, in run
consumer_finished.send(sender=self.__class__)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 193, in send
for receiver in self._live_receivers(sender)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/message.py", line 93, in send_and_flush
sender.send(message, immediately=True)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/channels/channel.py", line 44, in send
self.channel_layer.send(self.name, content)
File "/home/team/releases/current/virtualenv/local/lib/python2.7/site-packages/asgi_redis/core.py", line 177, in send
raise self.ChannelFull
asgiref.base_layer.ChannelFull
Issue Analytics
- State:
- Created 6 years ago
- Comments:6 (3 by maintainers)
Ah yes, I see what’s happening, the atomic message handling is not correctly dealing with ChannelFull. I’ll work on a fix for it soon.
This issue started affecting me in production and development too; thanks for the super quick fix.