Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Include execution cache hit ratios

See original GitHub issue

Description of the feature request:

It would be ideal if we could surface cache hit ratios as first class information. As it stands now I’m only aware of two ways to try and calculate cache hit ratios from a users point of view. One is to create the execution log and parse it for remoteCacheable and remoteCacheHit then calculate the standard hit ratio. The other is to use build event logs and pluck out the runner information, for example INFO: 1436 processes: 977 disk cache hit, 456 internal, 1 darwin-sandbox, 2 worker. As discussed in https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1659330964512159?thread_ts=1659329231.807649&cid=C01E7TH8XK9 there are some heuristics we could use to assume some knowledge of cache hit ratio, but it is not concrete.

The request here is to surface cache hit ratio for different dimensions

Cache hit for overall execution and some indication that no execution was done or 100% cache hit because no execution was done, in the case where rebuild yields nothing new.
Cache hit for remote cached requests for different remote caches (remote/disk)
Ratio of where cache was found, remote/disk.

What underlying problem are you trying to solve with this feature?

We are trying to understand when we hit cache regressions from a client side point of view. Recently bazel was released with broken caching in 5.2.0 https://github.com/bazelbuild/bazel/issues/15682#issuecomment-1175781129, we (twitter) realized it when our remote build cache systems experienced a lower cache hit ratio on cache requests over the network. It would have been ideal for us to also log that kind of metrics from the clients point of view as the source of truth that the runner indeed did miss and what state it might have been in at that time. It would also allow us to put monitoring metrics around the health of the runners when executing on some branches, such as master.

Easy access to this information would greatly help us identify regressions sooner and give us more confidence in bazel and rules upgrades.

Which operating system are you running Bazel on?

osx,linux

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What’s the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

chancilacommented, Aug 31, 2022

In what cases is the distinction remote/remote cache hit as seen above not good enough?

isn’t remote the remote exec executor? isn’t that orthogonal?

the example could be the build has 1000 total actions, 500 of which are not remote cacheable, and 500 which are…technically in this build if i get 500 remote cache hits I have 100% cache hit…this number is currently impossible to derive without the execution log. Currently we use the total number of actions / cache hits which gives us a signal we can track for regression as long as the ratio of cacheable/not cacheable doesn’t change drastically.

a significant portion of our build graph is not cacheable, this is on purpose because in some cases certain actions are faster to run locally

the size of GB

yea we have exec logs the size of 40gb+

but they compress down to a few 100 MBs with zstd and a couple of tweaks, mainly long range compression…it would be cool if bazel compressed the log as it writes it to avoid murdering IO…on macos our builds are significantly slower due to the IO hit of the exec log.

0reactions

adam-singercommented, Aug 31, 2022

@chancila @meisterT we use BEP to extract runner counts, but it’s not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I’ve seen some of our executions logs in the size of GB (even when serialized into protobuf).

Our attempt at calculation of cache hit ratio is the following (its crude and I don’t think its accurate)

total = "total" (default to -1 if does not exist in runner)
cache_hits = "disk cache hit" (default to 0 if does not exist in runner) + "remote cache hit" (default to 0 if does not exist in runner)
internal = "internal" (default to 0 if does not exist in runner)
total_cachable_actions = total - internal

cache_hit_ratio = 
if total_cachable_actions < 0:
  # error we don't have runner data in BEP
 None
else if total_cachable_actions == 0:
  # assume we don't have runner metrics for cached items because they are not returned, so must be empty and internal = 1 and total = 1
  100.00
else:
  (cache_hits / total_cachable_actions) * 100

return cache_hit_ratio

Consider trying --experimental_stats_summary which will show a line like this:
INFO: 104 processes: 96 remote cache hit, 1 internal, 7 remote.
If you want to have the info programmatically, you can use the BEP, e.g. by adding the flag --build_event_json_file=/tmp/bep.json and then doing this:
$ jq '.buildMetrics.actionSummary.runnerCount | select( . != null )' < /tmp/bep.json
[
  {
    "name": "total",
    "count": 104
  },
  {
    "name": "remote cache hit",
    "count": 96
  },
  {
    "name": "internal",
    "count": 1
  },
  {
    "name": "remote",
    "count": 7
  }
]

Our attempt at calculation of cache hit ratio is the following (its crude and I don’t think its accurate)

total = "total" (default to -1 if does not exist in runner)
cache_hits = "disk cache hit" (default to 0 if does not exist in runner) + "remote cache hit" (default to 0 if does not exist in runner)
internal = "internal" (default to 0 if does not exist in runner)
total_cachable_actions = total - internal

cache_hit_ratio = 
if total_cachable_actions < 0:
  # error we don't have runner data in BEP
 None
else if total_cachable_actions == 0:
  # assume we don't have runner metrics for cached items because they are not returned, so must be empty and internal = 1 and total = 1
  100.00
else:
  (cache_hits / total_cachable_actions) * 100

return cache_hit_ratio

While this “works” its probably lossy at best if not completely wrong.