question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bazel occasionally stuck during execution when using multiplexed workers

See original GitHub issue

Description of the problem:

Bazel occasionally gets stuck during execution when using multiplexed workers

This issue is mentioned in the Multiplexed Workers doc. I’m filing this bug to open it up to the community since we have not been successful in tracking it down so far. If anyone trying multiplexed workers runs into it, it would be valuable to hear back from you.

What’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We have not been able to reproduce this bug consistently. It seems to be due to a race condition.

What operating system are you running Bazel on?

Ubuntu 18.04

What’s the output of bazel info release?

We’re using a custom release (off of Bazel 0.29); however, the symptoms have also been observed by others running different versions of Bazel.

Have you found anything relevant by searching the web?

Similar hanging behavior was observed in one of the multiplexed worker tests.

Any other information, logs, or outputs that you want to share?

The following snippets from thread dumps on the bazel server and on the worker process show that each is stuck listening for messages from the other.

Bazel server:

"Thread-2": running, holding [0x000000072def8500]
	at java.io.FileInputStream.readBytes(java.base@11.0.2/Native Method)
	at java.io.FileInputStream.read(java.base@11.0.2/Unknown Source)
	at java.io.BufferedInputStream.fill(java.base@11.0.2/Unknown Source)
	at java.io.BufferedInputStream.read(java.base@11.0.2/Unknown Source)
	at java.io.FilterInputStream.read(java.base@11.0.2/Unknown Source)
	at com.google.devtools.build.lib.worker.RecordingInputStream.read(RecordingInputStream.java:56)
	at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:253)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:275)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:280)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
	at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:347)
	at com.google.devtools.build.lib.worker.WorkerProtocol$WorkResponse.parseDelimitedFrom(WorkerProtocol.java:2279)
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:185)
	at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:210)

Worker process:

"main": running, holding [0x0000000704878b68]
	at java.io.FileInputStream.readBytes(Native Method)
	at java.io.FileInputStream.read(FileInputStream.java:255)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
	at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:246)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:267)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:272)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:48)
	at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:368)
	at com.google.devtools.build.lib.worker.WorkerProtocol$WorkRequest.parseDelimitedFrom(WorkerProtocol.java:1164)
	at higherkindness.rules_scala.common.worker.WorkerMain.process$1(WorkerMain.scala:53)
	at higherkindness.rules_scala.common.worker.WorkerMain.main(WorkerMain.scala:101)
	at higherkindness.rules_scala.common.worker.WorkerMain.main$(WorkerMain.scala:27)
	at higherkindness.rules_scala.workers.zinc.compile.ZincRunner$.main(ZincRunner.scala:55)
	at higherkindness.rules_scala.workers.zinc.compile.ZincRunner.main(ZincRunner.scala)

Important files:

  • The multiplexer is implemented in bazelbuild/bazel: WorkerMultiplexer.java
  • The worker used by the symptomatic multiplexed worker test mentioned earlier is also implemented in bazelbuild/bazel: ExampleWorkerMultiplexer.java (note: this is a multiplexer-compatible worker, not a multiplexer)
  • The worker we use (and which yielded the worker process thread dump snippet above) is implemented in a multiplexer-compatible branch of higherkindness/rules_scala: ZincRunner.scala, WorkerMain.scala

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
SrodriguezOcommented, Oct 9, 2020

We figured out what was causing this issue for us. It was a problem with our multiplex worker implementation. The issue occurred whenever the worker encountered a Fatal exception, in which case the worker would neither report the failure nor exit. More details here

Since Future only catches NonFatal exceptions, any Fatal exceptions would cause the Future to never complete, which meant a WorkResponse was never sent back to the bazel server, which would then be stuck waiting forever for a response.

The change I linked seems to have resolved the issue entirely for us. It seems @tomdegoede encountered a different deadlock on Bazel’s end though, so I’ll leave this issue open for now.

1reaction
susinmotioncommented, Dec 6, 2019

Ah, I see. We aren’t using this feature yet, so we haven’t been encountering this problem, but our tests do show it. I’ll keep this issue open for our team (or anyone else who has ideas!), but it may be some time before we have a chance to investigate.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multiplex workers - Bazel 4.2.2
Multiplex workers allow Bazel to handle multiple requests with a single worker process. ... Occasionally, Bazel hangs indefinitely at the execution phase.
Read more >
CHANGELOG.md - Google Git
Multiplex persistent workers can now use the JSON protocol. native.existing_rule now returns ... Fix Bazel Coverage with C++ to work with Remote Execution...
Read more >
Fixing Bazel out-of-memory problems - Aspect Blog
Check the output of bazel info | grep heap if the Bazel server is still running, see if it is near the max....
Read more >
bazel - What could be causing "no action" state?
One reason that I've seen this is if Bazel is computing digests of lots of large files. If you see getDigestInExclusiveMode in the...
Read more >
bazel brings my system to its knees... how to troubleshoot?
When run, Bazel seems to run correctly for a bit but eventually gets "stuck" on some set of files; each file's compile timer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found