Include execution cache hit ratios
See original GitHub issueDescription of the feature request:
It would be ideal if we could surface cache hit ratios as first class information. As it stands now I’m only aware of two ways to try and calculate cache hit ratios from a users point of view. One is to create the execution log and parse it for remoteCacheable
and remoteCacheHit
then calculate the standard hit ratio. The other is to use build event logs and pluck out the runner information, for example INFO: 1436 processes: 977 disk cache hit, 456 internal, 1 darwin-sandbox, 2 worker
. As discussed in https://bazelbuild.slack.com/archives/C01E7TH8XK9/p1659330964512159?thread_ts=1659329231.807649&cid=C01E7TH8XK9 there are some heuristics we could use to assume some knowledge of cache hit ratio, but it is not concrete.
The request here is to surface cache hit ratio for different dimensions
- Cache hit for overall execution and some indication that no execution was done or 100% cache hit because no execution was done, in the case where rebuild yields nothing new.
- Cache hit for remote cached requests for different remote caches (remote/disk)
- Ratio of where cache was found, remote/disk.
What underlying problem are you trying to solve with this feature?
We are trying to understand when we hit cache regressions from a client side point of view. Recently bazel was released with broken caching in 5.2.0 https://github.com/bazelbuild/bazel/issues/15682#issuecomment-1175781129, we (twitter) realized it when our remote build cache systems experienced a lower cache hit ratio on cache requests over the network. It would have been ideal for us to also log that kind of metrics from the clients point of view as the source of truth that the runner indeed did miss and what state it might have been in at that time. It would also allow us to put monitoring metrics around the health of the runners when executing on some branches, such as master
.
Easy access to this information would greatly help us identify regressions sooner and give us more confidence in bazel and rules upgrades.
Which operating system are you running Bazel on?
osx,linux
What is the output of bazel info release
?
No response
If bazel info release
returns development version
or (@non-git)
, tell us how you built Bazel.
No response
What’s the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD
?
No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
isn’t
remote
the remote exec executor? isn’t that orthogonal?the example could be the build has 1000 total actions, 500 of which are not remote cacheable, and 500 which are…technically in this build if i get 500 remote cache hits I have 100% cache hit…this number is currently impossible to derive without the execution log. Currently we use the total number of actions / cache hits which gives us a signal we can track for regression as long as the ratio of cacheable/not cacheable doesn’t change drastically.
a significant portion of our build graph is not cacheable, this is on purpose because in some cases certain actions are faster to run locally
yea we have exec logs the size of 40gb+
but they compress down to a few 100 MBs with zstd and a couple of tweaks, mainly long range compression…it would be cool if bazel compressed the log as it writes it to avoid murdering IO…on macos our builds are significantly slower due to the IO hit of the exec log.
@chancila @meisterT we use BEP to extract runner counts, but it’s not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I’ve seen some of our executions logs in the size of GB (even when serialized into protobuf).
Our attempt at calculation of cache hit ratio is the following (its crude and I don’t think its accurate)
@chancila @meisterT we use BEP to extract runner counts, but it’s not a very concrete set of information and you need to make assumptions on what the runner means in terms of its relation to caching. Execution logs provide concrete information about caching, the problem is on large targets the size of an execution log is huge, I’ve seen some of our executions logs in the size of GB (even when serialized into protobuf).
Our attempt at calculation of cache hit ratio is the following (its crude and I don’t think its accurate)
While this “works” its probably lossy at best if not completely wrong.