Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.
See original GitHub issueDescription of the problem:
With a gRPC remote cache and BES backend enabled, Bazel intermittently fails to write build event file with the following error:
java.lang.RuntimeException: Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:503)
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:581)
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:599)
at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:626)
at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:604)
at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:231)
at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:543)
at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:606)
at io.grpc.Context$1.run(Context.java:579)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Must not call uploadBlobs after shutdown.
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:102)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:237)
at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:487)
... 11 more
I’ve traced this down FileTransport.java and BuildEventServiceUploader.java both calling shutdown
on the same ByteStreamBuildEventArtifactUploader. Normally this is fine, because they shutdown around the same time, after all uploads are complete.
However, if findMissingDigests takes a while, FileTransport’s shutdown is called which shuts down the ByteStreamBuildEventArtifactUploader. Then when findMissingDigests returns - upload is attempted on an uploader that has already been shut down.
This happens in practice on builds with thousands of outputs that need to be uploaded. It can also be triggered artificially by adding an intermittent sleep into Bazel in GrpcCacheClient here before returning the missing digests.
if (Math.random() < .1) {
try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}
Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
Create the following BUILD file that generates 10 dummy outputs to be downloaded.
$ cat BUILD
[genrule(
name = "target-{}".format(i),
outs = ["output-{}.txt".format(i)],
cmd = "echo {} > $@".format(i),
visibility = ["//visibility:public"],
) for i in range(10)]
Add intermittent sleep to Bazel in GrpcCacheClient here before returning the missing digests.
if (Math.random() < .1) {
try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}
Run a build with --remote_cache
, --bes_backend
, --build_event_json_file
set. Setting a random --remote_instance_name
makes sure that the outputs will be freshly uploaded on each run. This fails >50% of the time.
bazel build //... --remote_cache=cloud.buildbuddy.dev --bes_backend=cloud.buildbuddy.dev --build_event_json_file=/tmp/bep.json --remote_instance_name=$(date +%s)
What operating system are you running Bazel on?
Linux
What’s the output of bazel info release
?
release 3.7.1
I’ve reproduced on 3.1.0 and 3.7.1, but haven’t tried outside of that range. It’s easiest to reproduce with a custom Bazel version with the sleep.
Any other information, logs, or outputs that you want to share?
Some less than ideal fixes I’ve found for this include removing the shutdown call from FileTransport, or wrapping uploadLocalFiles in BuildEventArtifactUploader with uploader.retain() and uploader.release(). Don’t love either of these.
Would love any advice from someone who is more familiar with this code on what a good fix would look like - would be happy to send a pull request. I’ve seen this error with 4 different companies we’ve been working with.
Not sure if this this is related, but I’m hoping that the fix here might also fix another bug which we see much more frequently and doesn’t seem to require setting build_event_json_file
, but I’ve been having a harder time reproducing reliably enough to file a detailed bug report.
FAILED: Build did NOT complete successfully
WARNING: BES was not properly closed
Internal error thrown during build.
Printing stack trace: java.util.concurrent.RejectedExecutionException:
Task com.google.common.util.concurrent.TrustedListenableFutureTask@12febc19
[status=PENDING, info=[task=[running=[NOT STARTED YET],
com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda$1058/0x00000008008b5c40@633e8b27]]]
rejected from java.util.concurrent.ThreadPoolExecutor@b0ef88a
[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 172]
My hunch for that one is that after commit https://github.com/bazelbuild/bazel/commit/d82341d508829c43ab6408910f0b03cacf5b03b2 that ThreadPool is shut down and some race condition is causing uploads to be added to the ThreadPool after it has been shutdown.
Really appreciate any and all help - would be happy to send pull requests for fixes. Just looking for some guidance on what the right fix looks like.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
The fix is currently in review; the patch may be accessed early at https://bazel-review.googlesource.com/c/bazel/+/149490.
I’ve successfully reproduced this issue using the steps provided. I’m looking into a fix now. Thanks for reporting this!