Transform component, running beam with Flink Runner. Flink job fails with ModuleNotFoundError.
See original GitHub issueSystem information
- Have I specified the code to reproduce the issue (Yes, No): No
- Environment in which the code is executed (e.g., Local(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): Linux
- TensorFlow version: 2.5.3
- TFX Version: 1.2.0
- Python version: 3.7
- Python dependencies (from
pip freeze
output):
Docker image for sdk workers built from
FROM apache/beam_python3.7_sdk:2.39.0 tfx==1.2.0 tensorflow==2.5.3
Describe the current behavior
Running a tfx pipeline with FlinkRunner for beam. The Transform component manages to start the flink job but it instantly fails with ModuleNotFoundError: No module named 'pipeline_functions'
.
I wonder where can I find the code responsible for adding user code as --extra_package
or actually installing the wheel which contains my user code ? I was not able to locate it in the repository…
Also I wonder if this comment is still “actual” (given my versions of tfx and beam) ?
Should I disable the wheel packaging and set force_tf_compat_v1=True
in the Transform component ? If yes I wonder what are the implications of this ?
More globally I was hoping to start a discussion with potential other users of TFX who might have a similar setup, we are moving from running TFX pipelines with GCS for storage and Dataflow runners for Beam towards running TFX pipelines with HDFS for storage and Flink runners for Beam. The journey is tortuous and I was hoping to find potential other users who might wanna share their experience.
Thank you !
Issue Analytics
- State:
- Created a year ago
- Comments:11 (6 by maintainers)
Top GitHub Comments
@jccarles Solution 1 has the upside (if it works) that you only need to deploy one Flink cluster and that can be used for all jobs. This is better if it works, but as I mentioned I had troubles with it.
Not that I recall actually. The issues I have had have usually been about one worker crashing due to some issue, usually data I believe, and then the other workers (processes) eventually fails with a grpc error.
Other issues that I’ve had has improved by updating beam and flink to the highest available versions.
I’m no expert but I can have a look if you post your flink charts.
@jccarles Actually, I created tfx python custom components. They are very similar to kubeflow components: https://www.tensorflow.org/tfx/guide/custom_function_component
To make this work I actually created a second pipeline which spawned the flink resources, started the TFX pipeline and waited for it to finish/crash and then tore down flink. In the event of a failure the resources were kept for a longer time period, say 24h.
Today I believe this could be implemented by using a custom TFX component that spins up flink and then you use the .add_downstream_component (similar to kubeflows .after) and then you add an exit handler which can take care of the graceful flink shutdown.
No worries! I had so much pain with this, so if I can spare someone a portion of this, then that is awsome!
Hope this helps and feel free to let me know if you have any other questions!