question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

InfraValidator fails in KubeFlow

See original GitHub issue

System information

  • Have I specified the code to reproduce the issue: No, but standard TFX pipeline with InfraValidator in KubeFlow should reproduce this issue.
  • Environment in which the code is executed : On-prem KubeFlow cluster. KF v1.1.0. k8s 1.19
  • TensorFlowversion (you are using): 2.4.0
  • TFX Version: 0.27.0
  • Python version: 3.7

Describe the current behavior InfraValidator seems to load and query model successfully but fails when cleaning up resources.

Describe the expected behavior Successfully cleaning up resources.

Standalone code to reproduce the issue Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook. No, but standard TFX pipeline with InfraValidator in KubeFlow should reproduce this issue.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

InfraValidator fails when deleting the tf serving pod. It fails with an error from istio but noteworthy is that the pod name it tries to delete is None. Logs from failure:

INFO:absl:Starting infra validation (attempt 1/5).
INFO:absl:Starting KubernetesRunner(image: docker.vby.svenskaspel.se:8181/tensorflow/serving:2.3.0, pod_name: None).
INFO:absl:Stopping KubernetesRunner(image: docker.vby.svenskaspel.se:8181/tensorflow/serving:2.3.0, pod_name: None).
INFO:absl:Deleting Pod (name=None)
WARNING:absl:Error occurred while deleting the Pod. Please run the following command to manually clean it up:
kubectl delete pod --namespace admin None
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/executor.py", line 356, in _ValidateOnce
    runner.Start()
  File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py", line 140, in Start
    body=self._BuildPodManifest())
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
    (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6206, in create_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/api_client.py", line 344, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/api_client.py", line 178, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/api_client.py", line 387, in request
    body=body)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/rest.py", line 266, in POST
    body=body)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f16c63f3-113e-4eba-b50b-5f56f81c7599', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 17 Feb 2021 13:30:17 GMT', 'Content-Length': '457'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"sidecar-injector.istio.io\": Post \"
https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s
\": EOF","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"sidecar-injector.istio.io\": Post \"https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s\": EOF"}]},"code":500}
https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s
\": EOF"}]},"code":500}


During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py", line 178, in Stop
    self._DeleteModelServerPod()
  File "/usr/local/lib/python3.7/dist-packages/apache_beam/utils/retry.py", line 260, in wrapper
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py", line 195, in _DeleteModelServerPod
    namespace=self._namespace)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 9788, in delete_namespaced_pod
    (data) = self.delete_namespaced_pod_with_http_info(name, namespace, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 9830, in delete_namespaced_pod_with_http_info
    raise ValueError("Missing the required parameter `name` when calling `delete_namespaced_pod`")
ValueError: Missing the required parameter `name` when calling `delete_namespaced_pod`
ERROR:absl:Infra validation (attempt 1/5) failed.

This then repeats five times. Weird is that even though errors are thrown, InfraValidator component still does not fail itself.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
chongkongcommented, Apr 15, 2021

Thanks will take a look. Let’s continue discussion from the PR.

0reactions
google-ml-butler[bot]commented, May 6, 2022

Are you satisfied with the resolution of your issue? Yes No

Read more comments on GitHub >

github_iconTop Results From Across the Web

TFX InfraValidator component fails in KubeFlow #5164
When InfraValidator tries to spin up TFServing using the CreateNamespacedPod it get a 500 error which seems to originate from istio. See logs ......
Read more >
The InfraValidator TFX Pipeline Component
If InfraValidator fails, the model will not be pushed. ... launched in the same Kubernetes cluster and the namespace that Kubeflow is using....
Read more >
Troubleshooting
The following sections describe how to resolve issues that can occur when installing or using the Kubeflow Pipelines SDK. Error: Could not find ......
Read more >
ML Model in Production: Real-world example of End-to-End ...
If InfraValidator fails, the model will not be pushed. ... Kubeflow is the cloud-native platform for machine learning operations like ...
Read more >
tfx Changelog - pyup.io
Fixed a compatibility issue with apache-airflow 2.3.0 that is failing with ... TFX CLI now supports runtime parameter on Kubeflow, Vertex, and Airflow....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found