question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unexpected Exception 'null' when closing BEP transports, this is a bug.

See original GitHub issue

Description of the problem / feature request:

I was updating from bazel 3.4.1 to 3.5.0, and after flipping to use 3.5.0 our builds started to fail with the following exception:

[2020-09-29T12:54:59.710Z] Executed 0 out of 1550 tests: 590 tests pass and 960 were skipped.
[2020-09-29T12:54:59.720Z] There were tests whose specified size is too big. Use the --test_verbose_timeout_warnings command line option to see which ones these are.
[2020-09-29T12:54:59.725Z] INFO: Build completed successfully, 15248 total actions
[2020-09-29T12:54:59.737Z] Internal error thrown during build. Printing stack trace: java.lang.RuntimeException: Unexpected Exception 'null' when closing BEP transports, this is a bug.
[2020-09-29T12:54:59.748Z] 	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:475)
[2020-09-29T12:54:59.757Z] 	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:550)
[2020-09-29T12:54:59.767Z] 	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:565)
[2020-09-29T12:54:59.774Z] 	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:628)
[2020-09-29T12:54:59.784Z] 	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:617)
[2020-09-29T12:54:59.792Z] 	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:236)
[2020-09-29T12:54:59.799Z] 	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:661)
[2020-09-29T12:54:59.807Z] 	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$2(GrpcServerImpl.java:726)
[2020-09-29T12:54:59.810Z] 	at io.grpc.Context$1.run(Context.java:605)
[2020-09-29T12:54:59.816Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[2020-09-29T12:54:59.823Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[2020-09-29T12:54:59.827Z] 	at java.base/java.lang.Thread.run(Unknown Source)
[2020-09-29T12:54:59.836Z] java.lang.RuntimeException: Unexpected Exception 'null' when closing BEP transports, this is a bug.
[2020-09-29T12:54:59.850Z] 	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:475)
[2020-09-29T12:54:59.863Z] 	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:550)
[2020-09-29T12:54:59.875Z] 	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:565)
[2020-09-29T12:54:59.884Z] 	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:628)
[2020-09-29T12:54:59.895Z] 	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:617)
[2020-09-29T12:54:59.904Z] 	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:236)
[2020-09-29T12:54:59.914Z] 	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:661)
[2020-09-29T12:54:59.922Z] 	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$2(GrpcServerImpl.java:726)
[2020-09-29T12:54:59.927Z] 	at io.grpc.Context$1.run(Context.java:605)
[2020-09-29T12:54:59.935Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
[2020-09-29T12:54:59.943Z] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
[2020-09-29T12:54:59.948Z] 	at java.base/java.lang.Thread.run(Unknown Source)

Unfortunately I do not have an exact reproduction case, as this appears to be an underlying timing issue, and occurs when we have a large set of build/test targets run using a BRE environment.

The error only occurs when we are writing out the build events as json, and there is a large set of targets, in my case 1550 tests, which leads to 15248 build actions. In my case I can reproduce the issue in our pipelines and locally fairly consistently with our target set of //…

bazel test --build_event_json_file=/tmp/build_events.json --config=remote-fast //...

In the above --config=remote-fast enables remote caching and remote execution to a buildbarn BRE setup.

I started to bisect to find which change between bazel 3.4.1 and 3.5.0, triggered the issue.

* BROKEN * d30b1997420b17af7864b46f2065269b42d692ba Caclulate configurable attribute values with iteration instead of recursion
* WORKS * 30bfcdf98470c85bb00d98487becb00d82cb82d6 bazel syntax: break lib.syntax -> skyframe.serialization dependency
* WORKS * 87e3017b8761c6769fbaa1f1370fed6df82fc9cc Support signing key rotation in android_binary.
932133f890608e15a5bc45b976c28b6b7df93274 bazel syntax: break lib.syntax tests -> skyframe.serialization.testutils dependency
* WORKS * 993ef784083da11f8e5e4958ca2c96d61743c4c6 The great Skylark -> Starlark class migration:
8874477c9ec5700efe9b924a3be76099ebd6ede7 Restore shared_library attribute in Starlark's cc_import
* WORKS * 003f7664c91f4fec81e7607c0b59651915836d3a Failed target/aspect completions include resolved artifact paths in BEP.

However after reverting just d30b1997420b17af7864b46f2065269b42d692ba the issue still occurs, this lead me to believe that this is more a timing issue, and it was a combination of changes, that cause an underlying bug to be tickled. Some of the changes leading up to the exception, are large and maybe cause timings that expose an existing race condition?

I modified the suspect area to remove the e.getCause().getMessage that is printing the ‘null’ in the exception catch log.

-              e.getCause().getMessage()));
+              e.getMessage()));

This shows the true exception of ‘java.lang.UnsupportedOperationException’ Unexpected Exception 'java.lang.UnsupportedOperationException' when closing BEP

Following the trail from this call to block while waiting for for the BEP to closeBepTransports try (AutoProfiler p = GoogleAutoProfilerUtils.logged(“waiting for BES close”)) { Uninterruptibles.getUninterruptibly(Futures.allAsList(transportFutures.values())); } I changed things up a bit more to simply catch the exception then print the stack, subverting the RuntimeException handling. This leads to the following exception.

java.lang.UnsupportedOperationException
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:526)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:487)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
	at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.waitForBuildEventTransportsToClose(BuildEventServiceModule.java:463)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.closeBepTransports(BuildEventServiceModule.java:541)
	at com.google.devtools.build.lib.buildeventservice.BuildEventServiceModule.afterCommand(BuildEventServiceModule.java:556)
	at com.google.devtools.build.lib.runtime.BlazeRuntime.afterCommand(BlazeRuntime.java:628)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.execExclusively(BlazeCommandDispatcher.java:617)
	at com.google.devtools.build.lib.runtime.BlazeCommandDispatcher.exec(BlazeCommandDispatcher.java:236)
	at com.google.devtools.build.lib.server.GrpcServerImpl.executeCommand(GrpcServerImpl.java:661)
	at com.google.devtools.build.lib.server.GrpcServerImpl.lambda$run$2(GrpcServerImpl.java:726)
	at io.grpc.Context$1.run(Context.java:605)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.UnsupportedOperationException
	at com.google.devtools.build.lib.remote.RemoteActionFileSystem$1.getNodeId(RemoteActionFileSystem.java:317)
	at com.google.devtools.build.lib.actions.cache.DigestUtils$CacheKey.<init>(DigestUtils.java:104)
	at com.google.devtools.build.lib.actions.cache.DigestUtils.manuallyComputeDigest(DigestUtils.java:249)
	at com.google.devtools.build.lib.actions.cache.DigestUtils.getDigestWithManualFallback(DigestUtils.java:230)
	at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:64)
	at com.google.devtools.build.lib.remote.util.DigestUtil.compute(DigestUtil.java:60)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.readPathMetadata(ByteStreamBuildEventArtifactUploader.java:129)
	at com.google.devtools.build.lib.remote.ByteStreamBuildEventArtifactUploader.lambda$upload$3(ByteStreamBuildEventArtifactUploader.java:222)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)

So it looks like there is a operation of getNodeID called but that is not supported. at com.google.devtools.build.lib.remote.RemoteActionFileSystem$1.getNodeId(RemoteActionFileSystem.java:317)

And that looks like this might be the root cause of the issue the statFromRemoteMetadata throws a ‘UnsupportedOperationException’ https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java;drc=03be73edcf5d41bbef31220f9ebf86f1ee629103;l=317

Unfortunately I do not know enough of the internals of Bazel to fix the root issue. Hoping with this information someone familiar with this code, will be able to isolate further.

What operating system are you running Bazel on?

MacOS, Linux

What’s the output of bazel info release?

release 3.5.0

Have you found anything relevant by searching the web?

No issues found that match the signature of the exception.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:43 (23 by maintainers)

github_iconTop GitHub Comments

1reaction
janakdrcommented, Oct 9, 2020

Yeah, unfortunately the name of the file is not in the RemoteFileArtifactValue, which I knew, but was avoiding confronting 😃 . Will mail you a PR not for submission that adds more debugging.

1reaction
janakdrcommented, Oct 8, 2020

There have been some changes in the area of computing digests. @justinhorvitz made one, https://github.com/bazelbuild/bazel/commit/2983d0454ace4fcbe84c118eff0d7be6c7db7339. I audited it and didn’t see anything obvious that would cause this failure. It looks like this codepath is unexpected: RemoteActionFileSystem is supposed to have fast digests, so we should never get to the point where we’re looking for the node id. I can add some logging that will show the name of the file that’s causing this issue, which might help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

JDK-8212226 SurfaceManager throws "Invalid Image variant ...
getManager to throw an "Invalid Image variant" exception when the containing window is dragged from one screen to another. See the stack trace...
Read more >
Command-Line Reference | Bazel
Executes a dependency graph query. run, Runs the specified target. shutdown, Stops the bazel server. sync, Syncs all ...
Read more >
Oracle Universal Connection Pool for JDBC Developer's Guide
This guide is primarily written for Application Developers and System Architects who want to learn how to use UCP for JDBC to create...
Read more >
jdk/jdk: 17148c9457a6 - OpenJDK Mercurial Repositories
Oracle designates this -# particular file as subject to the "Classpath" exception as provided -# by Oracle in the LICENSE file that accompanied...
Read more >
INTRODUCTION TO JAVA PROGRAMMING
jstack – utility which prints Java stack traces of Java threads ... to beported to get Java to run on a new computer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found