question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug]: Beam Pipeline does not halt execution while running on non local Spark Cluster

See original GitHub issue

What happened?

I have deployed a Spark v3.1.2 cluster on kubernetes. My beam job server and beam sdk container are running on 2 separate linux virtual machines. The following keeps executing and does not stop

op = PipelineOptions([ "--runner=PortableRunner", "--job_endpoint=localhost:8099", "--environment_type=EXTERNAL", "--environment_config=vm2-hostname::50000", "--artifact_endpoint=localhost:8098" ] )

with beam.Pipeline(options=op) as p: p | beam.Create([1, 2, 3, 10]) | beam.Map(lambda x: x+1) | beam.Map(print)

The docker logs for the sdk container show the following error"

Starting worker with command [‘/opt/apache/beam/boot’, ‘–id=1-1’, ‘–logging_endpoint=localhost:43541’, ‘–artifact_endpoint=localhost:44461’, ‘–provision_endpoint=localhost:43225’, ‘–control_endpoint=localhost:35475’] E1101 22:11:16.787804592 22 fork_posix.cc:76] Other threads are currently calling into gRPC, skipping fork() handlers 2022/11/01 22:13:16 Failed to obtain provisioning information: failed to dial server at localhost:43225 caused by: context deadline exceeded

Issue Priority

Priority: 3

Issue Component

Component: sdk-py-harness

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
cozoscommented, Nov 29, 2022

At least on the RDD-based Spark runner (i.e. not Dataset/Structured Streaming):

What is the purpose of SDK service?

All Beam pipelines are converted into a Java Spark RDD pipeline - if you are writing your DoFns in Python, Java RDDs cannot execute your Python code. So the SDK Harness contains your Python environment and Spark executes Python logic code there.

does it mean that each spark worker node should have it’s own beam sdk service?

Spark workers communicate with SDK Harnesses via the gRPC Fn API. It’s better to deploy them on the same host as the Spark worker in order to minimize network IO (data has to be sent back and forth between worker and SDK Harness for processing). You can deploy them on the same node as a Docker container or a process (see the --environment_type option). However --environment_type EXTERNAL has its own advantages, as the SDK Harness does not have to share resources (such as CPU and memory) with the Spark worker.

0reactions
cozoscommented, Dec 7, 2022

Where is the inference code executed. Is it executed in the SDK harness service

Yes it is executed in the SDK Harness which is your case is a Docker container

If so can that service use the underlying GPUs. Also can I run any pytorch and HuggingFace Transformed model using RunInference.

You need to install dependencies such as pytorch available in the Docker container. Same for GPUs - you need to install CUDA drivers and whatnot. You also need to do the needful for making GPUs accessible from Docker (I don’t know how - probably here: https://docs.docker.com/config/containers/resource_constraints/#gpu). See instructions on how to build container here: https://beam.apache.org/documentation/runtime/environments/#custom-containers

Seems like converting data and send it back and forth between spark worker and sdk service may involve lot of overhead

Sending data back and forth through the Fn API involves serialization/deserialization and sending your data through the transport/network layer so yes there is overhead.

Will a native Spark job be faster then a Beam Job? Is there a performance hit when we write jobs in Beam instead of Native Spark?

You’d have to benchmark your pipeline. But my guess would be that native Spark is faster. In addition to the Fn API overhead, if you can use the higher level Spark APIs (Spark SQL, Dataframe/Dataset), Spark can apply additional optimizations (vectorization, codegen) to your transforms - whereas Beam Transforms/DoFns are a blackbox and cannot be optimized.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Task]: Improve documentation of PortableRunner for usage with ...
Users are struggling to use the portable runner with Spark. ... [Bug]: Beam Pipeline does not halt execution while running on non local...
Read more >
Pipeline troubleshooting and debugging | Cloud Dataflow
This page provides troubleshooting tips and debugging strategies that you might find helpful if you're having trouble building or running your Dataflow ...
Read more >
How to configure beam python sdk with spark in a kubernetes ...
we launch a beam pipeline creating an embedded spark job server on the spark worker who needs to run a python SDK jointly....
Read more >
Testing I/O Transforms - Apache Beam
Introduction. This document explains the set of tests that the Beam community recommends based on our past experience writing I/O transforms ...
Read more >
One task — two solutions: Apache Spark or Apache Beam?
Apache Beam is based on so-called abstract pipelines that can be run ... Local run in our case does not mean 100% execution...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found