question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Default Implicits from cats.effect.IOApp cause BlazeServer to crash when a JVM error occurs

See original GitHub issue

When the default ContextShift and Timer from cats.effect.IOApp is passed to BlazeServerBuilder, if there’s a java.lang.Error raised during the processing of one of the HTTP Requests, the error bubbles up to the main thread running the server and kills the app.

This behaviour doesn’t seem to happen if the user defines their own ContextShift and Timer

Example

Suppose there’s a HTTP Service that is throwing a StackOverflowError for some input data

trait HelloWorld {
  def hello(n: HelloWorld.Name): IO[HelloWorld.Greeting]
}

def impl: HelloWorld = new HelloWorld {
    def hello(n: HelloWorld.Name): IO[HelloWorld.Greeting] = {
      simulateError
    }

    def simulateError: IO[HelloWorld.Greeting] =
      IO.fromEither {
        Either.catchNonFatal {
          throw new StackOverflowError("BOOM BANG!!")
        }
      }
  }

If the user makes a request to the endpoint bound to this implementation:

curl http://localhost:8080/hello/john

Then the Error bubbles up to the main thread and kills the app. However, if we override the default implicits in the app, then this curl request just times out and the app is still running which would be the desired behaviour.

  import cats.effect.{ContextShift, Timer}
  override implicit val contextShift: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
  override implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)

I have the sample application available here which simulates this error.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

5reactions
djspiewakcommented, Nov 24, 2021

Fatal errors, in general, can leave the JVM in an undefined state. StackOverflowError is probably a bad example because it is far and away the most benign, but many of the others (such as OOM) should not be caught except under very specific circumstances.

Story time: I once had a bug that only occurred in production, and only after prod had been live for a few days. What we were seeing ultimately narrowed down to floating point arithmetic producing the wrong result. It was almost as simple as 2 + 2 and seeing an answer of 5. No objects, nothing fancy, just primitives that were somehow wrong. And again, this only happened after prod had been live for a while.

After an absolutely heroic minimization effort, one of my coworkers finally noticed that, GBs further up in the logs, a StackOverflowError had been printed, seemingly benignly. It didn’t really make any sense, particularly since the stack trace didn’t actually look like a loop, but it was the first real clue. Digging into it further, we realized the SOE was actually a consequence of something much more subtle: Future at the time was catching all Throwable, not just NonFatal, and it would attempt to wrap those Throwables just like any other. The real problem was that the application was running out of memory, and thus an OutOfMemoryError was being raised within Future, which promptly caught it and tried to allocate a failed Future to carry it, but that allocation failed, which tripped a second OutOfMemoryError (still within Future), etc etc.

When all the dust settled, the OOMs had been completely eaten, and the error printed was a StackOverflowError for reasons I can’t quite remember, but by that point, the JVM was still running and the only external consequence was that an HTTP request had timed out, but internally it was in deep trouble. Digging into the JVM spec on this type of situation, what was happening was ultimately that the JVM itself (I think specifically the GC) was in a completely undefined state, with its internal data structures no longer obeying invariants. This in turn was enough to generate the seemingly-impossible and unrelated consequences later on down the line in that same instance, such as floating point arithmetic behaving impossibly.

Moral of the story: be very, very, very careful around catching Throwable. It’s fine for some “fatal” errors, like InterruptedException and StackOverflowError (though with the former, you should be sure to handle it appropriately), but in general it’s not something you should be doing.

CE3 improves on this situation from CE2 btw. The runtime will still exit on fatal error, but it will propagate the exception back to the unsafeRun call site without allocating or relying on safe points (generally within IOApp), which then prints the exception, removes the application finalizers, and calls sys.exit(-1). So it still takes down the app, but at least the errors are in the right place and will deterministically print.

Ultimately, fatal errors are a bit like segfaults: you simply cannot suppress them in general. My best advice, if you’re honestly expecting SOEs from some effect in normal operation, would be to stick a manual try/catch around the root of the SOE before it gets wrapped in IO and suppress the exception manually.

1reaction
armanbilgecommented, Nov 27, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

IOApp · Cats Effect - Typelevel
IOApp is a safe application type that describes a main which executes a cats.effect.IO, as an entry point to a pure FP program....
Read more >
IO · Cats Effect - Typelevel
An IO is a data structure that represents just a description of a side effectful computation. IO can describe synchronous or asynchronous computations...
Read more >
IOApp - Typelevel
The primary entry point to a Cats Effect application. Extend this trait rather than defining your own main method. This avoids the need...
Read more >
Migration Guide · Cats Effect - Typelevel
Here is an overview of the steps you should take to migrate your application to Cats Effect 3: Make sure your dependencies have...
Read more >
Thread Model · Cats Effect - Typelevel
concurrent.ExecutionContext.global is a poor choice for your compute pool as its fork-join design assumes that there will be blocking operations performed on it ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found