ASGIRef 3.4.1 + Channels 3.0.3 causes non-deterministic 500 errors serving static files
See original GitHub issueSo this was the very first scenario in which the Single thread executor error was found and that lead to me opening django/asgiref#275
While trying to get a simple repro-case for it, we figured out a way to trigger an error related to it in a very simple way and this was fixed with https://github.com/django/asgiref/releases/tag/3.4.1
But testing the new 3.4.1 version against our code-base still yielded the same 500 errors while serving static files (at least) in the dev environment.
I’ve updated https://github.com/rdmrocha/asgiref-thread-bug with this new repro-case, by loading a crapload of JS files
(1500) but that can be changed in the views.py
file.
It doesn’t ALWAYS happen (so you might need a hard-refresh or two) but when it does, you’ll be greeted with something like this:
I believe this is still related https://github.com/django/asgiref/commit/13d0b82a505a753ef116e11b62a6dfcae6a80987 as reverting to v3.3.4 via requirements.txt makes the error go away.
Looking at the offending code inside channels/http.py
it looks like this might be a thread exhaustion issue but this is pure speculation.
since the handle is decorated as sync_to_async:
This is forcing the
send
to become sync and we’re waiting on it like this: await self.handle(scope, async_to_sync(send), body_stream)
.
If there’s no more threads available, I speculate that they might end up in a deadlock waiting for the unwrap of this await async_to_sync(async_to_sync) call, eventually triggering the protection introduced in https://github.com/django/asgiref/commit/13d0b82a505a753ef116e11b62a6dfcae6a80987
But take this last part with a grain of salt as this is pure speculation without diving into the code and debugging it. Hope it helps
Issue Analytics
- State:
- Created 2 years ago
- Reactions:9
- Comments:34 (8 by maintainers)
Having ran into this issue and dug into its root cause, I think I can provide some insight. As I understand it, the deadlock detection in asgiref works like this:
The issue here is that contexts may be re-used by daphne / twisted in the case of persistent connections. When a second HTTP request is sent on the same TCP connection, twisted re-uses the same context from the existing connection instead of creating a new one.
So in twisted, context variables are per connection not per http request. This subtle difference then causes a problem due to how the
StaticFilesHandler
, built on the channelsAsgiHandler
works. It usesasync_to_sync(send)
to pass the send method intoself.handle()
, which is itself decorated withsync_to_async
.So what I think is happening is this sequence of events:
send()
(but the sync thread does not yet exit!)ASGIHandler
andSyncToAsync
try to use the single thread executor but it’s busy, so it errorsIf step 6 blocked instead of erroring, all would be fine, since the sync thread would have finished anyways. I don’t think there’s a deadlock here, and I don’t thing the deadlock detection code in asgiref is working properly.
OK, I’ve started work on what will be v4.0 #1890 moves to use
django.contrib.staticfiles
, and should address this.If anyone wants to give that a run, or follow
main
over the next few weeks to help spot any issues, that would be great.Once I’ve made a bit more progress, I’ll open a tracking issue for v4.0 as well.