Bazel occasionally stuck during execution when using multiplexed workers
See original GitHub issueDescription of the problem:
Bazel occasionally gets stuck during execution when using multiplexed workers
This issue is mentioned in the Multiplexed Workers doc. I’m filing this bug to open it up to the community since we have not been successful in tracking it down so far. If anyone trying multiplexed workers runs into it, it would be valuable to hear back from you.
What’s the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
We have not been able to reproduce this bug consistently. It seems to be due to a race condition.
What operating system are you running Bazel on?
Ubuntu 18.04
What’s the output of bazel info release
?
We’re using a custom release (off of Bazel 0.29); however, the symptoms have also been observed by others running different versions of Bazel.
Have you found anything relevant by searching the web?
Similar hanging behavior was observed in one of the multiplexed worker tests.
Any other information, logs, or outputs that you want to share?
The following snippets from thread dumps on the bazel server and on the worker process show that each is stuck listening for messages from the other.
Bazel server:
"Thread-2": running, holding [0x000000072def8500]
at java.io.FileInputStream.readBytes(java.base@11.0.2/Native Method)
at java.io.FileInputStream.read(java.base@11.0.2/Unknown Source)
at java.io.BufferedInputStream.fill(java.base@11.0.2/Unknown Source)
at java.io.BufferedInputStream.read(java.base@11.0.2/Unknown Source)
at java.io.FilterInputStream.read(java.base@11.0.2/Unknown Source)
at com.google.devtools.build.lib.worker.RecordingInputStream.read(RecordingInputStream.java:56)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:275)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:280)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:347)
at com.google.devtools.build.lib.worker.WorkerProtocol$WorkResponse.parseDelimitedFrom(WorkerProtocol.java:2279)
at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:185)
at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:210)
Worker process:
"main": running, holding [0x0000000704878b68]
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:246)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:267)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:272)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:48)
at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:368)
at com.google.devtools.build.lib.worker.WorkerProtocol$WorkRequest.parseDelimitedFrom(WorkerProtocol.java:1164)
at higherkindness.rules_scala.common.worker.WorkerMain.process$1(WorkerMain.scala:53)
at higherkindness.rules_scala.common.worker.WorkerMain.main(WorkerMain.scala:101)
at higherkindness.rules_scala.common.worker.WorkerMain.main$(WorkerMain.scala:27)
at higherkindness.rules_scala.workers.zinc.compile.ZincRunner$.main(ZincRunner.scala:55)
at higherkindness.rules_scala.workers.zinc.compile.ZincRunner.main(ZincRunner.scala)
Important files:
- The multiplexer is implemented in bazelbuild/bazel: WorkerMultiplexer.java
- The worker used by the symptomatic multiplexed worker test mentioned earlier is also implemented in bazelbuild/bazel: ExampleWorkerMultiplexer.java (note: this is a multiplexer-compatible worker, not a multiplexer)
- The worker we use (and which yielded the worker process thread dump snippet above) is implemented in a multiplexer-compatible branch of higherkindness/rules_scala: ZincRunner.scala, WorkerMain.scala
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (11 by maintainers)
We figured out what was causing this issue for us. It was a problem with our multiplex worker implementation. The issue occurred whenever the worker encountered a Fatal exception, in which case the worker would neither report the failure nor exit. More details here
The change I linked seems to have resolved the issue entirely for us. It seems @tomdegoede encountered a different deadlock on Bazel’s end though, so I’ll leave this issue open for now.
Ah, I see. We aren’t using this feature yet, so we haven’t been encountering this problem, but our tests do show it. I’ll keep this issue open for our team (or anyone else who has ideas!), but it may be some time before we have a chance to investigate.