Subscriptions topic stalls after some time.
See original GitHub issueIs there an existing issue for this?
- I have searched the existing issues
Product
Hot Chocolate
Describe the bug
Subscription topic randomly stops working after some time. The client is still subscribed and ping/pongs are still being sent, and new messages arrive for some topics except for the failed one. It’s not limited to a specific topic, sometimes topicA fails, another time it’s topicB, but eventually they all fail.
I have made a minimal solution that can be used to reproduce the issue, you can find it in this repo: https://github.com/DownGoat/HotChocolate13SubIssue
It has a single query that has properties that takes some time to resolve (simulating dataloaders, and slow db calls), and a single subscription that returns some random data. This is very similar to our setup for our project where a controller endpoint is being sent data every 5 seconds that it pushes to the topic that usually stalls first. It is the topic that has the most subscribers, and the only one that frequently sends new messages.
I have tested the same solution with version 12.17.0, and I have not managed to reproduce the issue. We first noticed it after upgrading to version 13.
Steps to reproduce
- Start the following subscription
subscription VesselPositions { listVessels { timeStamp int1 int2 int3 } }
- Send data to the topic by sending a GET request to
https://localhost:5001/WeatherForecast
. This endpoint sends a list of three entities with random data for the intN fields and the current time. I use postman to repeat this request forever with a 10ms delay between each request. - Start and stop the subscription in BCP, after some tries the subscription stalls, and you wont be getting any new data. If this is taking a lot of time, try opening a new tab in BCP where you run the following query, while it resolves continue starting and stopping the subscription.
query WatchingPaintDry { slowEntity { prop1 prop2 prop3 prop4 } }
Relevant log output
No response
Additional Context?
No response
Version
13.x.x
Issue Analytics
- State:
- Created 7 months ago
- Reactions:5
- Comments:7 (4 by maintainers)
Top GitHub Comments
We are having the same problem. After spending quite a lot of time debugging this in the HotChocolate libraries, we managed to narrow down the issue to the HotChocolate.Subscriptions.TopicShard<T> class.
TLDR: Closed subscriptions are not cleaned up correctly in HC (we think) and there is a major bug which in theory should impact anyone using subscriptions.
As far as we can tell: A new channel is added to the _outgoing list in the topic shard when a new subscription is created. When removing the subscription (clicking the stop / cancel button in BCP) the socket is stopped in the browser, but the channel is never removed from the list mentioned above in hot chocolate.
When reproducing we see the following: The topic itself has a outbound channel with a default buffer of 64 messages. Each subscription get’s its own channel as outgoing buffer. In order for the topic to “complete” a outgoing message, each subscriber needs to receive and acknowledge their copy. But because closing subscriptions never completes / removes the associated channel, the _outgoing list in the TopicShard grows by one and is never reduced again.
This in turn blocks the topic incoming queue which reaches it’s capacity of 64 items and the messages are never processed since the queue for the “dead” subscription are still pending.
We have reproduced this with BOTH graphql-ws protocols and graphql-transport-ws transports as well as both in memory and redis subscription handlers.
We have also confirmed that this issue is per topic. Create and stop a subscription, then publish 64 messages to the topic, and the topic will be stalled indefinitely.
It also seems like one subscription can stall the entire topic processing if it never acknowledges it’s messages.
The channel removal / cleanup issue is also very likely to be the source of this redis specific issue: https://github.com/ChilliCream/graphql-platform/issues/5336
Michael posted this workaround in the Slack for in-memory, which seems to work for me: