question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.

See original GitHub issue

Description of the problem:

With a gRPC remote cache and BES backend enabled, Bazel intermittently fails to write build event file with the following error:

java.lang.RuntimeException: Unexpected Exception 'Must not call uploadBlobs after shutdown.' when closing BEP transports, this is a bug.
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:503)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:581)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:599)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:626)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:604)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:231)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:543)
	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$1(GrpcServerImpl.java:606)
	at io.grpc.Context$1.run(Context.java:579)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Must not call uploadBlobs after shutdown.
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:564)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:545)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:102)
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:237)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:487)
	... 11 more

I’ve traced this down FileTransport.java and BuildEventServiceUploader.java both calling shutdown on the same ByteStreamBuildEventArtifactUploader. Normally this is fine, because they shutdown around the same time, after all uploads are complete.

However, if findMissingDigests takes a while, FileTransport’s shutdown is called which shuts down the ByteStreamBuildEventArtifactUploader. Then when findMissingDigests returns - upload is attempted on an uploader that has already been shut down.

This happens in practice on builds with thousands of outputs that need to be uploaded. It can also be triggered artificially by adding an intermittent sleep into Bazel in GrpcCacheClient here before returning the missing digests.

if (Math.random() < .1) {
    try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}

Bugs: what’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Create the following BUILD file that generates 10 dummy outputs to be downloaded.

$ cat BUILD
[genrule(
    name = "target-{}".format(i),
    outs = ["output-{}.txt".format(i)],
    cmd = "echo {} > $@".format(i),
    visibility = ["//visibility:public"],
) for i in range(10)]

Add intermittent sleep to Bazel in GrpcCacheClient here before returning the missing digests.

if (Math.random() < .1) {
    try { Thread.sleep((long)(Math.random() * 3000)); } catch(InterruptedException e) {}
}

Run a build with --remote_cache, --bes_backend, --build_event_json_file set. Setting a random --remote_instance_name makes sure that the outputs will be freshly uploaded on each run. This fails >50% of the time.

bazel build //... --remote_cache=cloud.buildbuddy.dev --bes_backend=cloud.buildbuddy.dev --build_event_json_file=/tmp/bep.json --remote_instance_name=$(date +%s)

What operating system are you running Bazel on?

Linux

What’s the output of bazel info release?

release 3.7.1

I’ve reproduced on 3.1.0 and 3.7.1, but haven’t tried outside of that range. It’s easiest to reproduce with a custom Bazel version with the sleep.

Any other information, logs, or outputs that you want to share?

Some less than ideal fixes I’ve found for this include removing the shutdown call from FileTransport, or wrapping uploadLocalFiles in BuildEventArtifactUploader with uploader.retain() and uploader.release(). Don’t love either of these.

Would love any advice from someone who is more familiar with this code on what a good fix would look like - would be happy to send a pull request. I’ve seen this error with 4 different companies we’ve been working with.

Not sure if this this is related, but I’m hoping that the fix here might also fix another bug which we see much more frequently and doesn’t seem to require setting build_event_json_file, but I’ve been having a harder time reproducing reliably enough to file a detailed bug report.

FAILED: Build did NOT complete successfully
WARNING: BES was not properly closed
Internal error thrown during build. 
Printing stack trace: java.util.concurrent.RejectedExecutionException: 
Task com.google.common.util.concurrent.TrustedListenableFutureTask@12febc19
[status=PENDING, info=[task=[running=[NOT STARTED YET], 
com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader$$Lambda$1058/0x00000008008b5c40@633e8b27]]] 
rejected from java.util.concurrent.ThreadPoolExecutor@b0ef88a
[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 172]

My hunch for that one is that after commit https://github.com/bazelbuild/bazel/commit/d82341d508829c43ab6408910f0b03cacf5b03b2 that ThreadPool is shut down and some race condition is causing uploads to be added to the ThreadPool after it has been shutdown.

Really appreciate any and all help - would be happy to send pull requests for fixes. Just looking for some guidance on what the right fix looks like.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
michaeledgarcommented, Dec 29, 2020

The fix is currently in review; the patch may be accessed early at https://bazel-review.googlesource.com/c/bazel/+/149490.

1reaction
michaeledgarcommented, Dec 4, 2020

I’ve successfully reproduced this issue using the steps provided. I’m looking into a fix now. Thanks for reporting this!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Azure Quickstart - Create a blob in object storage using Go
In this quickstart, you create a storage account and a container in object (Blob) storage. Then you use the storage client library for...
Read more >
python - Error in Azure Storage Explorer with Azurite : The first ...
The function runs and doesn't return any error message. When I use a queue instead of a blob as output, it works and...
Read more >
General Guidelines: Azure Core | Azure SDKs
The HTTP pipeline consists of a HTTP transport that is wrapped by multiple policies. ... May contain a slash, but must not contain...
Read more >
Bazel 5.3.0 release candidate 2 is available for testing
[5.x] Remote: Fix "file not found" error when remote cache is ... 5.x: Remote: Ignore blobs referenced in BEP if the generating
Read more >
azblob - Go Packages
If you're not sure how you want to treat a blob, you can call ... Note that the http client closes the body...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found