Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sbt 1.x Performance improvement targets

See original GitHub issue

During the last weeks I spent a bit of time analyzing the performance of starting an already fully cached sbt build (i.e. taking the times with time sbt exit && time sbt exit && time sbt exit and ignoring the first result). I probably won’t have time to act on these but I wanted to dump these here for anyone who wants to help improving the performance.

Calculating aggregations is slow. Aggregations are cached at the beginning. For each key the settings structure is queried for “What’s the value of aggregate in X”. This calculation is somewhat slow because aggregation usually falls back to the default value in global scope. To find this value usually the whole delegation chain has to be walked.
Validating key references. Before evaluating settings, sbt validates if the settings dependency structure is valid. This is slow because it uses slow generic abstractions like the AList and creates lots of garbage. I haven’t checked but I wonder if it necessary at all to check dependency chains before settings evaluation or if dependency problems could be collected while evaluating the settings tree.
Parallel settings evaluation is ineffective. Settings are evaluated concurrently using the INode setup. In my tests, I didn’t observe any notable parallelism even in cases with at least two slow tasks that didn’t seem to depend on each other. I haven’t looked into that deeply, but I suspect a few issues with parallel settings evaluation:
- Task splitup is much too fine-grained. Often there are only super small chunks of code to be evaluated. Each task submission for parallel execution has a cost and I suspect that for small tasks the cost outweigh the potential benefit through parallelism. It seems there are often linear dependency chains that would probably be faster executed linearly on one thread.
- Task scheduling is non-optimal. Scheduling is a very hard problem but there might be some clever ways to schedule things if you know enough about the structure of settings dependencies.
- It seems like settings evaluation would be a good candidate for using a fork join pool. The current thread pool executor might introduce additional latency and probably has worse scheduling than what you could by using the fork join pool directly.
I had a try at removing the parallel execution completely, which seemed to work and wasn’t slower in my limited testing at least. However, it wasn’t quite correct as some builds (cinnamon) started to fail with unresolvable setting dependencies. (I thought that after topological sort, dependencies should have always been executed before the dependees but that assumption might be wrong e.g. for Bind nodes). Parallel setting execution might be worth it on some builds, though.
splitExpressions during parsing .sbt files is slow because a Scala compiler has to be initialized for parsing. I played around with caching the results of splitExpressions in https://github.com/jrudolph/xsbt/commit/b7a9f740b3426ac46ee92eec0efdb8cd2a8d16cb which seemed to work quite well (I didn’t figure out how to use sbt’s caching infrastructure correctly, though).
As already noted in https://github.com/sbt/sbt/issues/3694, a main performance problem is the de(serialization) of update task caches (which are accessed to build the meta-project and the project itself). There are several issues:
- Tracked.lastOutput isn’t well-suited for the task at hand, as the cache reading code runs markAsCached to set a flag that the data comes from the cache, after the data was read, and afterwards the whole data that was just deserialized from the cache is again written to the cache (see https://github.com/jrudolph/xsbt/commit/3836732b202eaab79ab6636e5a3770cb7a1faf53 for a rather manual attempt to fix that).
- sjson-new deserialization is super slow. Reading the ~5MB caches for the akka project (and again for the meta project) takes about 1 second on my machine. The main reason for the slowness is that data is copied around lots of times because of putting a facade in front of scala-json and then copying data around. I attempted improving the performance in https://github.com/eed3si9n/sjson-new/commit/c248ba050157bbf60158f279587fc598afa42d0a but there is still lots of room for improvement. Improving that cache will likely also benefit the update task when run in the project itself. (Imo using a binary format like protobuf for the update cache makes most sense as that will likely be faster than any json representation).

I attached a zip with the flamegraphs captured with async-profiler. One including GC and JIT compilation threads and one with only the Java threads.

flame-graphs.zip

Issue Analytics

State:
Created 6 years ago
Reactions:17
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

lihaoyicommented, May 28, 2018

Some learnings from Mill:

splitExpressions during parsing .sbt files is slow because a Scala compiler has to be initialized for parsing. I played around with caching the results of splitExpressions in jrudolph/xsbt@b7a9f74 which seemed to work quite well (I didn’t figure out how to use sbt’s caching infrastructure correctly, though).

Mill/Ammonite use ScalaParse to split expressions in the top-level script file, but even that takes a while to initialize the first time purely due to classloading, and so it caches the output of splitExpressions. Seems like it shouldn’t be a hard thing to do to since it’s just a pure function String => Seq[String]?

sjson-new deserialization is super slow. Reading the ~5MB caches for the akka project (and again for the meta project) takes about 1 second on my machine

Would swapping in a fast JSON serializer be a solution here? Mill uses uPickle, which in my arbitrary benchmarks (which are more complex/branchy/intricate that typical cache JSON, which is full of long strings) does 65-70 mb/s on my macbook by default, 85-90mb/s if you cache the implicitly constructed serializer objects (Mill does when hot). This is for java.lang.String <-> case class conversion, and both read & write are similar speeds. That’s fast enough that Mill’s JSON handling basically doesn’t turn up in the profiles at all, and is dwarfed by things like stat-ing files or reading the individual JSON cache files off disk.

If you don’t want to use uPickle, Circe’s performance is within a factor of 2 (~60mb/s?), and Play-Json within a factor of 3-4 (~30mb/s?), which may be enough to make serialization disappear off your profiles

One thing that Mill faced that your profiles show SBT facing too is classloading time: according to akka-sbt-1.1.1-original.svg, your cold SBT startup is spending more than half it’s time in the C2 JIT compiler! Mill is also mostly classloading bound, and most of it’s 2s cold startup is classloading.

There’s not much to do here except to aggressively cache things so that cached startups don’t need to load as many classes. e.g. classloading scala.tools.nsc easily takes an additional 1-2 seconds (even without running anything!!!) and Mill/Ammonite go to great pains to make sure that when things are cached, scala.tools.nsc is not classloaded at all (we have a unit test for this!).

Notably, heavy dependencies like Scalaz or Cats work against you in trying to fight classloading time: even if you don’t run much code, just touching the library in a few places is enough to force the bulk of classloading to take place. That’s one reason in Ammonite/Mill I have been very aggressive about using culling libraries with large transitive dependency graphs in favor of 0-dependency libraries like uPickle

0reactions

azolotkocommented, Aug 8, 2021

I’ve been reading through advancedThresholdPolicy.hpp Here’s another interesting combination:

➜  export JAVA_OPTS="-XX:Tier3DelayOn=0 -XX:Tier3DelayOff=0 -XX:CICompilerCount=2"

➜  hyperfine --warmup 5 --min-runs 5 'sbt exit'
Benchmark #1: sbt exit
  Time (mean ± σ):      7.880 s ±  0.107 s    [User: 22.218 s, System: 1.584 s]
  Range (min … max):    7.703 s …  7.991 s    5 runs
 
➜  hyperfine --min-runs 2 'sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"'
Benchmark #1: sbt "clean; compile; clean; compile; clean; compile; clean; compile; clean; compile"
  Time (mean ± σ):     218.390 s ±  4.132 s    [User: 594.795 s, System: 41.521 s]
  Range (min … max):   215.468 s … 221.312 s    2 runs