Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Default Implicits from cats.effect.IOApp cause BlazeServer to crash when a JVM error occurs

See original GitHub issue

When the default ContextShift and Timer from cats.effect.IOApp is passed to BlazeServerBuilder, if there’s a java.lang.Error raised during the processing of one of the HTTP Requests, the error bubbles up to the main thread running the server and kills the app.

This behaviour doesn’t seem to happen if the user defines their own ContextShift and Timer

Example

Suppose there’s a HTTP Service that is throwing a StackOverflowError for some input data

trait HelloWorld {
  def hello(n: HelloWorld.Name): IO[HelloWorld.Greeting]
}

def impl: HelloWorld = new HelloWorld {
    def hello(n: HelloWorld.Name): IO[HelloWorld.Greeting] = {
      simulateError
    }

    def simulateError: IO[HelloWorld.Greeting] =
      IO.fromEither {
        Either.catchNonFatal {
          throw new StackOverflowError("BOOM BANG!!")
        }
      }
  }

If the user makes a request to the endpoint bound to this implementation:

curl http://localhost:8080/hello/john

Then the Error bubbles up to the main thread and kills the app. However, if we override the default implicits in the app, then this curl request just times out and the app is still running which would be the desired behaviour.

  import cats.effect.{ContextShift, Timer}
  override implicit val contextShift: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
  override implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)

I have the sample application available here which simulates this error.

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

5reactions

djspiewakcommented, Nov 24, 2021

Fatal errors, in general, can leave the JVM in an undefined state. StackOverflowError is probably a bad example because it is far and away the most benign, but many of the others (such as OOM) should not be caught except under very specific circumstances.

Story time: I once had a bug that only occurred in production, and only after prod had been live for a few days. What we were seeing ultimately narrowed down to floating point arithmetic producing the wrong result. It was almost as simple as 2 + 2 and seeing an answer of 5. No objects, nothing fancy, just primitives that were somehow wrong. And again, this only happened after prod had been live for a while.

After an absolutely heroic minimization effort, one of my coworkers finally noticed that, GBs further up in the logs, a StackOverflowError had been printed, seemingly benignly. It didn’t really make any sense, particularly since the stack trace didn’t actually look like a loop, but it was the first real clue. Digging into it further, we realized the SOE was actually a consequence of something much more subtle: Future at the time was catching all Throwable, not just NonFatal, and it would attempt to wrap those Throwables just like any other. The real problem was that the application was running out of memory, and thus an OutOfMemoryError was being raised within Future, which promptly caught it and tried to allocate a failed Future to carry it, but that allocation failed, which tripped a second OutOfMemoryError (still within Future), etc etc.

When all the dust settled, the OOMs had been completely eaten, and the error printed was a StackOverflowError for reasons I can’t quite remember, but by that point, the JVM was still running and the only external consequence was that an HTTP request had timed out, but internally it was in deep trouble. Digging into the JVM spec on this type of situation, what was happening was ultimately that the JVM itself (I think specifically the GC) was in a completely undefined state, with its internal data structures no longer obeying invariants. This in turn was enough to generate the seemingly-impossible and unrelated consequences later on down the line in that same instance, such as floating point arithmetic behaving impossibly.

Moral of the story: be very, very, very careful around catching Throwable. It’s fine for some “fatal” errors, like InterruptedException and StackOverflowError (though with the former, you should be sure to handle it appropriately), but in general it’s not something you should be doing.

CE3 improves on this situation from CE2 btw. The runtime will still exit on fatal error, but it will propagate the exception back to the unsafeRun call site without allocating or relying on safe points (generally within IOApp), which then prints the exception, removes the application finalizers, and calls sys.exit(-1). So it still takes down the app, but at least the errors are in the right place and will deterministically print.

Ultimately, fatal errors are a bit like segfaults: you simply cannot suppress them in general. My best advice, if you’re honestly expecting SOEs from some effect in normal operation, would be to stick a manual try/catch around the root of the SOE before it gets wrapped in IO and suppress the exception manually.

1reaction

armanbilgecommented, Nov 27, 2021

Continued in:

https://github.com/typelevel/cats-effect/issues/2586.

Top Results From Across the Web

IOApp · Cats Effect - Typelevel

IOApp is a safe application type that describes a main which executes a cats.effect.IO, as an entry point to a pure FP program....

IO · Cats Effect - Typelevel

An IO is a data structure that represents just a description of a side effectful computation. IO can describe synchronous or asynchronous computations...

IOApp - Typelevel

The primary entry point to a Cats Effect application. Extend this trait rather than defining your own main method. This avoids the need...

Migration Guide · Cats Effect - Typelevel

Here is an overview of the steps you should take to migrate your application to Cats Effect 3: Make sure your dependencies have...

Thread Model · Cats Effect - Typelevel

concurrent.ExecutionContext.global is a poor choice for your compute pool as its fork-join design assumes that there will be blocking operations performed on it ......