Uncaught exception in reader crashing application
See original GitHub issueHi!
I have an issue where if the nsqd process goes down or the network has some problem, it will unconditionally crash my application if I have an open Reader at the time this happens.
What happens is that I’m listening to a certain topic and channel and as part of this process the reader will try to write some data over the NSQDConnection
. The write method on nsqdconnection.js:424
will defer the actual writing by deferring the _flush function. When the flush function is executed the connection is down so the this.conn.write()
call on nsqdconnection.js:434
throws ERR_STREAM_WRITE_AFTER_END
. However, this is running a deferred context on the main event loop and cannot be caught, so it bubbles all the way up and kills my application.
Expected behaviour:
Handle the error and trigger the ‘error’ event on the Reader
instance.
Actual behaviour: Exception bubbles through the stack and kills the application. Impossible to trace which Reader that caused the error and therefore difficult to try to recover.
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (5 by maintainers)
Top GitHub Comments
@nekufa Your use-case is almost identical to ours and we structured our first solution in the same way and had approximately the same frequency of failures. The strange thing isn’t really that the connection is closed, since that can happen due to clients dropping out, but that the exception is not caught and instead allowed to bubble all the way to the top and kill the application.
And even if we managed to catch it on application level, the context would be lost and we were unable to determine which connection that caused the problem and should be retried.
@nekufa We changed our application logic to only use a single long-lived Reader instance and filtering and distributing the messages in application logic. Since then we have not seen this error in production. I am pretty confident though that there is a bug or race condition here somewhere since when we had a short Reader life cycle this crash happened a couple of times per day and I mostly think that now the Reader turnover is so low that combined with the low frequency of this error it might take months or years before it happens again.