Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Pusher][AI Platform Training & Prediction] Unexpected model deploy status

See original GitHub issue

While going though your tutorial link, I have encountered following error in Kubeflow logs console:

Traceback (most recent call last):
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 371, in <module>
    main()
  File "/tfx-src/tfx/orchestration/kubeflow/container_entrypoint.py", line 364, in main
    execution_info = launcher.launch()
  File "/tfx-src/tfx/orchestration/launcher/base_component_launcher.py", line 205, in launch
    execution_decision.exec_properties)
  File "/tfx-src/tfx/orchestration/launcher/in_process_component_launcher.py", line 67, in _run_executor
    executor.Do(input_dict, output_dict, exec_properties)
  File "/tfx-src/tfx/extensions/google_cloud_ai_platform/pusher/executor.py", line 91, in Do
    executor_class_path,
  File "/tfx-src/tfx/extensions/google_cloud_ai_platform/runner.py", line 253, in deploy_model_for_aip_prediction
    name='{}/versions/{}'.format(model_name, deploy_status['response']
KeyError: 'response'

AI Platform model version shows error (without any further explanation):

Create Version failed. Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"

I have no further possibility to debug this with above messages. Is your tutorial incomplete?

Issue Analytics

State:
Created 4 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

4reactions

jiyongjung0commented, Mar 27, 2020

It turns out that even if we used TF 1.15 while running TFX, the base container image which is used for training had TF2. And the model had some TF2 ops. Full error messages:

tensorflow_serving/util/retrier.cc:37] Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'ParseExampleV2' in binary running on localhost. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

So there are two possible work-arounds.

Force trainer to use TF1 (See below).
Manually re-upload to use TF 2.1 runtime in Google Cloud Console as I explained in the above comment.

Using TF1 for training

We uses a custom container image for training. This is the image we specify in build_target_image flag and contains all model codes in it. This image is built automatically when we create or update the pipeline. Dockerfile for this image is generated in the working directory when we first create the pipeline. And you can safely modify this Dockerfile. tfx CLI doesn’t touch the file if it already exists.

Open Dockerfile and add RUN pip install tensorflow==1.15 like below:

FROM tensorflow/tfx:0.21.2
WORKDIR /pipeline
COPY ./ ./
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"

RUN pip install tensorflow==1.15      <<<<< Add this.

And update the pipeline and run again. If you are using a pipeline which already ran some training, Trainer will try to use cached result rather than do the training again. So it might be need to set enable_cache of Pipeline object to False in pipeline.py:173.

Debugging a model deployment

FYI, you can debug your deploy using Stream logging in Prediction service. It is not supported in TFX yet, and you should use gcloud (or REST API) in your local terminal environment to deploy your model. For example,

$ gcloud beta ai-platform models create jj --regions us-central1 --enable-console-logging
Created ml engine model [projects/jiyongjung-test/models/jj].
$ gcloud ai-platform versions create jj_v --model jj --runtime-version=1.15 --python-version=3.7 --framework 'TENSORFLOW' --origin gs://jiyongjung-test/tfx_pipeline_output/a20200324a3/Trainer/model/249/serving_model_dir/export/chicago-taxi/1585016069
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.ai-platform.versions.create) Create Version failed. Bad model detected with error:  "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"

See also documentation in Google Cloud. Then you can see detailed log in Google Cloud Console Screen Shot 2020-03-27 at 10 32 35 AM

This problem will be fixed when we can use 2.1 CAIP Prediction runtime in next version of TFX.

Thank you for raising this issue.

1reaction

michal-demecki-profil-softwarecommented, Mar 23, 2020

Python: 3.7.6 TF: 2.1.0 TFX: 0.21.0 KFP: 0.2.5 Region: us-central1

Top Results From Across the Web

Deploying models | AI Platform Prediction | Google Cloud

This page explains how to deploy your model to AI Platform Prediction to get predictions. In order to deploy your trained model on...

GCP AI platform Custom prediction routine fails to load model ...

I'm currently trying to deploy a model to GCP AI platform using ... to load model: Unexpected error when loading the model: 92...

Retraining Model During Deployment: Continuous Training ...

There are several ways to productionalize your machine learning models, such as model-as-service, model-as-dependency, and batch predictions (precompute serving) ...

3 facts about time series forecasting that surprise experienced ...

For most ML models, you train a model, test it, retrain it if necessary until you've gotten satisfactory results, and then evaluate it...

Deployment prediction and training data export for custom ...

Select the deployment's model, current or previous, to export prediction data for. Range (UTC), Select the start and end dates of the period...