Default Implicits from cats.effect.IOApp cause BlazeServer to crash when a JVM error occurs
See original GitHub issueWhen the default ContextShift
and Timer
from cats.effect.IOApp
is passed to BlazeServerBuilder
, if there’s a java.lang.Error
raised during the processing of one of the HTTP Requests, the error bubbles up to the main thread running the server and kills the app.
This behaviour doesn’t seem to happen if the user defines their own ContextShift and Timer
Example
Suppose there’s a HTTP Service that is throwing a StackOverflowError for some input data
trait HelloWorld {
def hello(n: HelloWorld.Name): IO[HelloWorld.Greeting]
}
def impl: HelloWorld = new HelloWorld {
def hello(n: HelloWorld.Name): IO[HelloWorld.Greeting] = {
simulateError
}
def simulateError: IO[HelloWorld.Greeting] =
IO.fromEither {
Either.catchNonFatal {
throw new StackOverflowError("BOOM BANG!!")
}
}
}
If the user makes a request to the endpoint bound to this implementation:
curl http://localhost:8080/hello/john
Then the Error bubbles up to the main thread and kills the app. However, if we override the default implicits in the app, then this curl request just times out and the app is still running which would be the desired behaviour.
import cats.effect.{ContextShift, Timer}
override implicit val contextShift: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
override implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
I have the sample application available here which simulates this error.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
Fatal errors, in general, can leave the JVM in an undefined state.
StackOverflowError
is probably a bad example because it is far and away the most benign, but many of the others (such as OOM) should not be caught except under very specific circumstances.Story time: I once had a bug that only occurred in production, and only after prod had been live for a few days. What we were seeing ultimately narrowed down to floating point arithmetic producing the wrong result. It was almost as simple as
2 + 2
and seeing an answer of5
. No objects, nothing fancy, just primitives that were somehow wrong. And again, this only happened after prod had been live for a while.After an absolutely heroic minimization effort, one of my coworkers finally noticed that, GBs further up in the logs, a
StackOverflowError
had been printed, seemingly benignly. It didn’t really make any sense, particularly since the stack trace didn’t actually look like a loop, but it was the first real clue. Digging into it further, we realized the SOE was actually a consequence of something much more subtle:Future
at the time was catching allThrowable
, not justNonFatal
, and it would attempt to wrap thoseThrowable
s just like any other. The real problem was that the application was running out of memory, and thus anOutOfMemoryError
was being raised withinFuture
, which promptly caught it and tried to allocate a failedFuture
to carry it, but that allocation failed, which tripped a secondOutOfMemoryError
(still withinFuture
), etc etc.When all the dust settled, the OOMs had been completely eaten, and the error printed was a
StackOverflowError
for reasons I can’t quite remember, but by that point, the JVM was still running and the only external consequence was that an HTTP request had timed out, but internally it was in deep trouble. Digging into the JVM spec on this type of situation, what was happening was ultimately that the JVM itself (I think specifically the GC) was in a completely undefined state, with its internal data structures no longer obeying invariants. This in turn was enough to generate the seemingly-impossible and unrelated consequences later on down the line in that same instance, such as floating point arithmetic behaving impossibly.Moral of the story: be very, very, very careful around catching
Throwable
. It’s fine for some “fatal” errors, likeInterruptedException
andStackOverflowError
(though with the former, you should be sure to handle it appropriately), but in general it’s not something you should be doing.CE3 improves on this situation from CE2 btw. The runtime will still exit on fatal error, but it will propagate the exception back to the
unsafeRun
call site without allocating or relying on safe points (generally withinIOApp
), which then prints the exception, removes the application finalizers, and callssys.exit(-1)
. So it still takes down the app, but at least the errors are in the right place and will deterministically print.Ultimately, fatal errors are a bit like segfaults: you simply cannot suppress them in general. My best advice, if you’re honestly expecting SOEs from some effect in normal operation, would be to stick a manual
try
/catch
around the root of the SOE before it gets wrapped inIO
and suppress the exception manually.Continued in: