Transform fails with S3 as backend storage
See original GitHub issueI’m running TFX in KubeFlow and now I’m trying to use an S3 backend, e.g. Minio.
ExampleGen, StatisticsGen and SchemaGen completes successfully but Transform fails. It seems to have finished all computations and has written the graph to a tmp dir in S3 but then it tries to copy it to the actual patch and fails.
In Minio I can see the output of the analyzer cache and the transform_tmp dir. transformed_examples is empty.
Below is the log output with the error:
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267047 nanos: 216625690 } message: "Assets added to graph." instruction_id: "bundle_170" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:666" thread: "Thread-12"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267047 nanos: 279367208 } message: "Assets written to: s3://pipelines/tfx/trace_pipeline_e2e/TransformMaster/transform_graph/1736/.temp_path/tftransform_tmp/3d458c0a7a00464f9632b2d467cdd963/assets" instruction_id: "bundle_170" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:775" thread: "Thread-12"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267047 nanos: 331916809 } message: "SavedModel written to: s3://pipelines/tfx/trace_pipeline_e2e/TransformMaster/transform_graph/1736/.temp_path/tftransform_tmp/3d458c0a7a00464f9632b2d467cdd963/saved_model.pb" instruction_id: "bundle_170" transform_id: "Analyze/CreateSavedModel[tf_compat_v1]/BindTensors/ReplaceWithConstants" log_location: "/usr/local/lib/python3.7/dist-packages/tensorflow/python/saved_model/builder_impl.py:426" thread: "Thread-12"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: ERROR timestamp { seconds: 1612267136 nanos: 111515760 } message: "Error processing instruction bundle_170. Original traceback is\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n return self.do_fn_invoker.invoke_process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n windowed_value, additional_args, additional_kwargs)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 893, in _invoke_process_per_window\n self.process_method(*args_for_process),\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py\", line 1590, in <lambda>\n wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 37, in _copy_tree_to_unique_temp_dir\n _copy_tree(source, destination)\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 51, in _copy_tree\n os.path.join(source, filename), os.path.join(destination, filename))\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 53, in _copy_tree\n tf.io.gfile.copy(source, destination)\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py\", line 513, in copy_v2\n compat.as_bytes(src), compat.as_bytes(dst), overwrite)\ntensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body.\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 289, in _execute\n response = task()\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 362, in <lambda>\n lambda: self.create_worker().do_instruction(request), request)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 607, in do_instruction\n getattr(request, request_type), request.instruction_id)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py\", line 644, in process_bundle\n bundle_processor.process_bundle(instruction_id))\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/bundle_processor.py\", line 1000, in process_bundle\n element.data)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/bundle_processor.py\", line 228, in process_encoded\n self.output(decoded_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 359, in output\n cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n self.consumer.process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n delayed_application = self.dofn_runner.process(o)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n self._reraise_augmented(exn)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n return self.do_fn_invoker.invoke_process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 588, in invoke_process\n windowed_value, self.process_method(windowed_value.value))\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n self.main_receivers.receive(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n self.consumer.process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n delayed_application = self.dofn_runner.process(o)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n self._reraise_augmented(exn)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n return self.do_fn_invoker.invoke_process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 588, in invoke_process\n windowed_value, self.process_method(windowed_value.value))\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n self.main_receivers.receive(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 221, in receive\n self.consumer.process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n delayed_application = self.dofn_runner.process(o)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n self._reraise_augmented(exn)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n return self.do_fn_invoker.invoke_process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n windowed_value, additional_args, additional_kwargs)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 894, in _invoke_process_per_window\n self.threadsafe_watermark_estimator)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1401, in process_outputs\n self.main_receivers.receive(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 158, in receive\n cython.cast(Operation, consumer).process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/operations.py\", line 719, in process\n delayed_application = self.dofn_runner.process(o)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1241, in process\n self._reraise_augmented(exn)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1321, in _reraise_augmented\n raise_with_traceback(new_exn)\n File \"/usr/local/lib/python3.7/dist-packages/future/utils/__init__.py\", line 446, in raise_with_traceback\n raise exc.with_traceback(traceback)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 1239, in process\n return self.do_fn_invoker.invoke_process(windowed_value)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 769, in invoke_process\n windowed_value, additional_args, additional_kwargs)\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py\", line 893, in _invoke_process_per_window\n self.process_method(*args_for_process),\n File \"/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py\", line 1590, in <lambda>\n wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 37, in _copy_tree_to_unique_temp_dir\n _copy_tree(source, destination)\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 51, in _copy_tree\n os.path.join(source, filename), os.path.join(destination, filename))\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py\", line 53, in _copy_tree\n tf.io.gfile.copy(source, destination)\n File \"/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py\", line 513, in copy_v2\n compat.as_bytes(src), compat.as_bytes(dst), overwrite)\nRuntimeError: tensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body. [while running \'WriteTransformFn/WriteTransformFnToTemp\']\n\n" instruction_id: "bundle_170" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:296" thread: "Thread-12"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 21243095 } message: "No more requests from control plane" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:266" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 21665096 } message: "SDK Harness waiting for in-flight requests to complete" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:267" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 21919965 } message: "Closing all cached grpc data channels." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/data_plane.py:721" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 22111654 } message: "Closing all cached gRPC state handlers." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:891" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 31976699 } message: "Done consuming work." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:279" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 32243013 } message: "Python sdk harness exiting." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py:166" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 851266622 } message: "No more requests from control plane" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:266" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 851534366 } message: "SDK Harness waiting for in-flight requests to complete" log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:267" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 851665258 } message: "Closing all cached grpc data channels." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/data_plane.py:721" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 851786136 } message: "Closing all cached gRPC state handlers." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:891" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 859477281 } message: "Done consuming work." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker.py:279" thread: "MainThread"
INFO:apache_beam.runners.portability.local_job_service:Worker: severity: INFO timestamp { seconds: 1612267137 nanos: 859626293 } message: "Python sdk harness exiting." log_location: "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/worker/sdk_worker_main.py:166" thread: "MainThread"
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py", line 1239, in process
return self.do_fn_invoker.invoke_process(windowed_value)
File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py", line 769, in invoke_process
windowed_value, additional_args, additional_kwargs)
File "/usr/local/lib/python3.7/dist-packages/apache_beam/runners/common.py", line 893, in _invoke_process_per_window
self.process_method(*args_for_process),
File "/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/core.py", line 1590, in <lambda>
wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)]
File "/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py", line 37, in _copy_tree_to_unique_temp_dir
_copy_tree(source, destination)
File "/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py", line 51, in _copy_tree
os.path.join(source, filename), os.path.join(destination, filename))
File "/usr/local/lib/python3.7/dist-packages/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py", line 53, in _copy_tree
tf.io.gfile.copy(source, destination)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 513, in copy_v2
compat.as_bytes(src), compat.as_bytes(dst), overwrite)
tensorflow.python.framework.errors_impl.AbortedError: All 10 retry attempts failed. The last failure: Unknown: : No response body.
Issue Analytics
- State:
- Created 3 years ago
- Comments:22 (14 by maintainers)
Top Results From Across the Web
Resolve errors uploading data to or downloading data ... - AWS
First, follow the steps below to run the SELECT INTO OUTFILE S3 or LOAD DATA FROM S3 commands using Amazon Aurora. If you...
Read more >Terraform Using AWS S3 Remote Backend - DataNext Solutions
There are many types of remote backends you can use with Terraform but in this post, we will cover the popular solution of...
Read more >How to manage Terraform state - Gruntwork Blog
Go to the Terraform code, remove the backend configuration, and rerun terraform init to copy the Terraform state back to your local disk....
Read more >Amazon S3 | Apache Flink
Amazon S3 # Amazon Simple Storage Service (Amazon S3) provides cloud object ... and writing data as well in conjunction with the streaming...
Read more >Copy and transform data in Amazon Simple Storage Service ...
Use the following steps to create an Amazon S3 linked service in the Azure portal UI. Browse to the Manage tab in your...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@arghyaganguly I’m using TFX 0.26.1 with it’s dependencies.
@Bobgy @PatrickXYS TFX now supports a recent enough beam version to be able to run on S3 and other components, e.g. ExampleGen, SchemaGen, StatisticsGen, all work as expected. The issue is solemnly with Transform, hence it is likely a TFX issue and not Kubeflow or KFP.
The actual call failing is
tf.io.gfile.copy(source, destination)
. I think someone from TFX should look at this.Closing this since this is not a tfx specific issue.Please raise an issue in Kubeflow. https://github.com/kubeflow/kubeflow/issues