InfraValidator fails in KubeFlow
See original GitHub issueSystem information
- Have I specified the code to reproduce the issue: No, but standard TFX pipeline with InfraValidator in KubeFlow should reproduce this issue.
- Environment in which the code is executed : On-prem KubeFlow cluster. KF v1.1.0. k8s 1.19
- TensorFlowversion (you are using): 2.4.0
- TFX Version: 0.27.0
- Python version: 3.7
Describe the current behavior InfraValidator seems to load and query model successfully but fails when cleaning up resources.
Describe the expected behavior Successfully cleaning up resources.
Standalone code to reproduce the issue Providing a bare minimum test case or step(s) to reproduce the problem will greatly help us to debug the issue. If possible, please share a link to Colab/Jupyter/any notebook. No, but standard TFX pipeline with InfraValidator in KubeFlow should reproduce this issue.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
InfraValidator fails when deleting the tf serving pod. It fails with an error from istio but noteworthy is that the pod name it tries to delete is None. Logs from failure:
INFO:absl:Starting infra validation (attempt 1/5).
INFO:absl:Starting KubernetesRunner(image: docker.vby.svenskaspel.se:8181/tensorflow/serving:2.3.0, pod_name: None).
INFO:absl:Stopping KubernetesRunner(image: docker.vby.svenskaspel.se:8181/tensorflow/serving:2.3.0, pod_name: None).
INFO:absl:Deleting Pod (name=None)
WARNING:absl:Error occurred while deleting the Pod. Please run the following command to manually clean it up:
kubectl delete pod --namespace admin None
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/executor.py", line 356, in _ValidateOnce
runner.Start()
File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py", line 140, in Start
body=self._BuildPodManifest())
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
(data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6206, in create_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/api_client.py", line 344, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/api_client.py", line 178, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/api_client.py", line 387, in request
body=body)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/rest.py", line 266, in POST
body=body)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'f16c63f3-113e-4eba-b50b-5f56f81c7599', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 17 Feb 2021 13:30:17 GMT', 'Content-Length': '457'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"sidecar-injector.istio.io\": Post \"
https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s
\": EOF","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"sidecar-injector.istio.io\": Post \"https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s\": EOF"}]},"code":500}
https://istio-sidecar-injector.istio-system.svc:443/inject?timeout=30s
\": EOF"}]},"code":500}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py", line 178, in Stop
self._DeleteModelServerPod()
File "/usr/local/lib/python3.7/dist-packages/apache_beam/utils/retry.py", line 260, in wrapper
return fun(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/tfx/components/infra_validator/model_server_runners/kubernetes_runner.py", line 195, in _DeleteModelServerPod
namespace=self._namespace)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 9788, in delete_namespaced_pod
(data) = self.delete_namespaced_pod_with_http_info(name, namespace, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/kubernetes/client/apis/core_v1_api.py", line 9830, in delete_namespaced_pod_with_http_info
raise ValueError("Missing the required parameter `name` when calling `delete_namespaced_pod`")
ValueError: Missing the required parameter `name` when calling `delete_namespaced_pod`
ERROR:absl:Infra validation (attempt 1/5) failed.
This then repeats five times. Weird is that even though errors are thrown, InfraValidator component still does not fail itself.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (10 by maintainers)
Top GitHub Comments
Thanks will take a look. Let’s continue discussion from the PR.
Are you satisfied with the resolution of your issue? Yes No