[Streaming-Core] frequent websocket disconnects
See original GitHub issueHi all,
I have spent some time tracking (and attempting to improve) websocket connection reliability in XChange and wanted to detail what I have so far for anyone who may be experiencing the same thing (and anyone that may have time to also work on improving this).
TLDR; I am seeing websocket disconnects anywhere from hourly to every 3 hours to up to about 12 hours. This is a lot, a conversation with Kroitor @ CCXT made me aware of the fact that they can go several days without experiencing a disconnect (for CoinbasePro/Kraken). I suspect these disconnects are self inflicted.
We have an IdleReadTimeout (how long we go without reading anything over the websocket) which will trigger us to disconnect if we don’t see anything for 15 seconds). I had added https://github.com/knowm/XChange/commit/3010143ce0d487756943d6178aefbe865ff91ef4 to send a ping because channels which weren’t busy would disconnect every few minutes.
This improved the disconnect rate for me from every few minutes to every few hours but is still not what it should be. True websocket disconnects tend to be a little tricky to figure out correctly because if a server process dies without notifying the client of a shutdown/disconnect hook the client has to figure on its own if the channel is dead and this is a bit empirical, things like a long GC pause on the server size can make the client think the channel is dead but may not actually be, so read timeouts tend to be the best efforts.
In any case out of the disconnect’s I see very rarely it is the server sending a close channel message which would print this log line LOG.info("WebSocket Client received closing! {}", ctx.channel());
. Most disconnects just tend to be what appears to be an inactive channel.
In the current stage I have modified the IdleStateHandler with additional logging to see what the stacktrace is before a channel has inactive methods called. I have attached said stack_trace.log. What is logged as read message is basically us reading a message over the websocket channel, you see there is a unusual gap before we start triggering idle code. However this gap is much smaller then the idle timeout. TempIdleStateHandler is what I’ve replaced IdleStateHandler in the NettyStreamingService class with.
I still have not figured out what the source of this issue is but hopefully this can be useful to anyone who has noticed this.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:6
- Comments:17 (6 by maintainers)
Top GitHub Comments
Another update, after opening an issue with the Netty folks it seems like this is getting triggered cause the websocket is receiving an EOF https://github.com/netty/netty/issues/10830.
Small update I have tracked down what is triggering the channel closing in the Netty code. AbstractNioByteChannel.java set to close which triggers
closeOnRead(pipeline)
. This appears like the websocket is entering a half open state but will update when I have more.