[Bug]: Beam Pipeline does not halt execution while running on non local Spark Cluster
See original GitHub issueWhat happened?
I have deployed a Spark v3.1.2 cluster on kubernetes. My beam job server and beam sdk container are running on 2 separate linux virtual machines. The following keeps executing and does not stop
op = PipelineOptions([ "--runner=PortableRunner", "--job_endpoint=localhost:8099", "--environment_type=EXTERNAL", "--environment_config=vm2-hostname::50000", "--artifact_endpoint=localhost:8098" ] )
with beam.Pipeline(options=op) as p: p | beam.Create([1, 2, 3, 10]) | beam.Map(lambda x: x+1) | beam.Map(print)
The docker logs for the sdk container show the following error"
Starting worker with command [‘/opt/apache/beam/boot’, ‘–id=1-1’, ‘–logging_endpoint=localhost:43541’, ‘–artifact_endpoint=localhost:44461’, ‘–provision_endpoint=localhost:43225’, ‘–control_endpoint=localhost:35475’] E1101 22:11:16.787804592 22 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers 2022/11/01 22:13:16 Failed to obtain provisioning information: failed to dial server at localhost:43225 caused by: context deadline exceeded
Issue Priority
Priority: 3
Issue Component
Component: sdk-py-harness
Issue Analytics
- State:
- Created a year ago
- Comments:10 (7 by maintainers)
At least on the RDD-based Spark runner (i.e. not Dataset/Structured Streaming):
All Beam pipelines are converted into a Java Spark RDD pipeline - if you are writing your DoFns in Python, Java RDDs cannot execute your Python code. So the SDK Harness contains your Python environment and Spark executes Python logic code there.
Spark workers communicate with SDK Harnesses via the gRPC Fn API. It’s better to deploy them on the same host as the Spark worker in order to minimize network IO (data has to be sent back and forth between worker and SDK Harness for processing). You can deploy them on the same node as a Docker container or a process (see the
--environment_type
option). However--environment_type EXTERNAL
has its own advantages, as the SDK Harness does not have to share resources (such as CPU and memory) with the Spark worker.Yes it is executed in the SDK Harness which is your case is a Docker container
You need to install dependencies such as
pytorch
available in the Docker container. Same for GPUs - you need to install CUDA drivers and whatnot. You also need to do the needful for making GPUs accessible from Docker (I don’t know how - probably here: https://docs.docker.com/config/containers/resource_constraints/#gpu). See instructions on how to build container here: https://beam.apache.org/documentation/runtime/environments/#custom-containersSending data back and forth through the Fn API involves serialization/deserialization and sending your data through the transport/network layer so yes there is overhead.
You’d have to benchmark your pipeline. But my guess would be that native Spark is faster. In addition to the Fn API overhead, if you can use the higher level Spark APIs (Spark SQL, Dataframe/Dataset), Spark can apply additional optimizations (vectorization, codegen) to your transforms - whereas Beam Transforms/DoFns are a blackbox and cannot be optimized.