Issue with model downloading from Google Storage when model is in subdirectory
See original GitHub issue/kind bug
I used GCP for Kubeflow deployment and I also put my model in the Google Storage of the same project in GCP.
I prepared sample deployment based on https://github.com/kubeflow/kfserving/blob/master/docs/samples/tensorflow/tensorflow.yaml
When I use storageUri: "gs://kfserving-samples/models/tensorflow/flowers"
and I change port deployment works correctly.
However, when I would like to use my own prepared Google Storage I receive an error in kfserving/storage-initializer
docker container:
FileExistsError: [Errno 17] File exists: './1'
and model cannot be read and the whole deployment returns error RevisionMissing
.
Google Storage model path: gs://<MY-BUCKET>/models/resnet50-tf-fp32/1/saved_model.pb
Below deployment.yaml
:
apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
name: "resnet50-tf-fp32"
spec:
default:
predictor:
tensorflow:
storageUri: "gs://<MY-BUCKET>/models/resnet50-tf-fp32"
ports:
- name: h2c
containerPort: 8500
I debugged kfserving/storage-initializer
docker container and bucket.list_blobs
from https://github.com/kubeflow/kfserving/blob/master/python/kfserving/kfserving/storage.py#L98
received for by bucket:
<Blob: <MY-BUCKET>, models/resnet50-tf-fp32/>,
<Blob: <MY-BUCKET>, models/resnet50-tf-fp32/1/>,
<Blob: <MY-BUCKET>, models/resnet50-tf-fp32/1/saved_model.pb]
During processing models/resnet50-tf-fp32/1/
this line: https://github.com/kubeflow/kfserving/blob/master/python/kfserving/kfserving/storage.py#L112 creates local file 1
with content placeholder
which is a culprit.
Then during processing models/resnet50-tf-fp32/1/saved_model.pb
application is trying to create 1
directory and receive an error as the file with the same name was created in previous step. Exact line:
https://github.com/kubeflow/kfserving/blob/master/python/kfserving/kfserving/storage.py#L108
I compared it with working example for gs://kfserving-samples/models/tensorflow/flowers
and bucket.list_blobs
returns:
<Blob: kfserving-samples, models/tensorflow/flowers/0001/saved_model.pb, 1557514652201269>,
<Blob: kfserving-samples, models/tensorflow/flowers/0001/variables/variables.data-00000-of-00001, 1557514652575363>,
<Blob: kfserving-samples, models/tensorflow/flowers/0001/variables/variables.index, 1557514652853907>
I works as no object with directory models/tensorflow/flowers/0001
was returned and processed by model-initializer
.
kubectl get all
NAME READY STATUS RESTARTS AGE
pod/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8bmh8rn 0/3 Init:Error 1 18s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/resnet50-tf-fp32-predictor-default-5x7qv ClusterIP 10.11.244.141 <none> 80/TCP 19s
service/resnet50-tf-fp32-predictor-default-5x7qv-private ClusterIP 10.11.253.131 <none> 80/TCP,9090/TCP,9091/TCP,8022/TCP 19s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/resnet50-tf-fp32-predictor-default-5x7qv-deployment 0/1 1 0 19s
NAME DESIRED CURRENT READY AGE
replicaset.apps/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8b97cc 1 1 0 19s
NAME URL READY REASON
route.serving.knative.dev/resnet50-tf-fp32-predictor-default http://resnet50-tf-fp32-predictor-default.kubeflow-tomasz-sadowski.example.com Unknown RevisionMissing
NAME CONFIG NAME K8S SERVICE NAME GENERATION READY REASON
revision.serving.knative.dev/resnet50-tf-fp32-predictor-default-5x7qv resnet50-tf-fp32-predictor-default resnet50-tf-fp32-predictor-default-5x7qv 1 Unknown Deploying
NAME LATESTCREATED LATESTREADY READY REASON
configuration.serving.knative.dev/resnet50-tf-fp32-predictor-default resnet50-tf-fp32-predictor-default-5x7qv Unknown
NAME URL LATESTCREATED LATESTREADY READY REASON
service.serving.knative.dev/resnet50-tf-fp32-predictor-default http://resnet50-tf-fp32-predictor-default.kubeflow-tomasz-sadowski.example.com resnet50-tf-fp32-predictor-default-5x7qv Unknown RevisionMissing
kubectl describe pod/....
Name: resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx
Namespace: kubeflow-tomasz-sadowski
Priority: 0
Node: gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg/10.128.15.219
Start Time: Fri, 08 May 2020 18:36:50 +0200
Labels: app=resnet50-tf-fp32-predictor-default-5x7qv
pod-template-hash=66cd8b97cc
serving.knative.dev/configuration=resnet50-tf-fp32-predictor-default
serving.knative.dev/configurationGeneration=1
serving.knative.dev/revision=resnet50-tf-fp32-predictor-default-5x7qv
serving.knative.dev/revisionUID=9af2da45-9149-11ea-923d-42010a800034
serving.knative.dev/service=resnet50-tf-fp32-predictor-default
serving.kubeflow.org/inferenceservice=resnet50-tf-fp32
Annotations: autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/target: 1
internal.serving.kubeflow.org/storage-initializer-sourceuri: gs://ai-inferencing/models/resnet50-tf-fp32
queue.sidecar.serving.knative.dev/resourcePercentage: 0.2
serving.knative.dev/creator: system:serviceaccount:kubeflow:default
sidecar.istio.io/status:
{"version":"5f3ae3613c7945ef767cb9fd594596bc001ff3ab915f12e4379c0cb5648d2729","initContainers":["istio-init"],"containers":["istio-proxy"]...
traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status: Pending
IP: 10.8.1.91
IPs: <none>
Controlled By: ReplicaSet/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8b97cc
Init Containers:
storage-initializer:
Container ID: docker://5e9e492a9e899b4d29ac128c2b5815353372f8892f710ddef18bed20b2ff9107
Image: gcr.io/kfserving/storage-initializer:0.2.2
Image ID: docker-pullable://gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
Port: <none>
Host Port: <none>
Args:
gs://ai-inferencing/models/resnet50-tf-fp32
/mnt/models
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 08 May 2020 18:37:38 +0200
Finished: Fri, 08 May 2020 18:37:39 +0200
Ready: False
Restart Count: 3
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 100m
memory: 100Mi
Environment: <none>
Mounts:
/mnt/models from kfserving-provision-location (rw)
istio-init:
Container ID:
Image: docker.io/istio/proxy_init:1.1.6
Image ID:
Port: <none>
Host Port: <none>
Args:
-p
15001
-u
1337
-m
REDIRECT
-i
*
-x
-b
8080,8022,9090,9091,8012
-d
15020
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 10m
memory: 10Mi
Environment: <none>
Mounts: <none>
Containers:
kfserving-container:
Container ID:
Image: index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
Command:
/usr/bin/tensorflow_model_server
Args:
--port=9000
--rest_api_port=8080
--model_name=resnet50-tf-fp32
--model_base_path=/mnt/models
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 1
memory: 2Gi
Environment:
PORT: 8080
K_REVISION: resnet50-tf-fp32-predictor-default-5x7qv
K_CONFIGURATION: resnet50-tf-fp32-predictor-default
K_SERVICE: resnet50-tf-fp32-predictor-default
Mounts:
/mnt/models from kfserving-provision-location (ro)
/var/log from knative-var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xpccj (ro)
queue-proxy:
Container ID:
Image: gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:792f6945c7bc73a49a470a5b955c39c8bd174705743abf5fb71aa0f4c04128eb
Image ID:
Ports: 8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 40m
memory: 200Mi
Requests:
cpu: 25m
memory: 50Mi
Readiness: exec [/ko-app/queue -probe-period 0] delay=0s timeout=10s period=1s #success=1 #failure=3
Environment:
SERVING_NAMESPACE: kubeflow-tomasz-sadowski
SERVING_SERVICE: resnet50-tf-fp32-predictor-default
SERVING_CONFIGURATION: resnet50-tf-fp32-predictor-default
SERVING_REVISION: resnet50-tf-fp32-predictor-default-5x7qv
QUEUE_SERVING_PORT: 8012
CONTAINER_CONCURRENCY: 0
REVISION_TIMEOUT_SECONDS: 60
SERVING_POD: resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx (v1:metadata.name)
SERVING_POD_IP: (v1:status.podIP)
SERVING_LOGGING_CONFIG: {
"level": "info",
"development": false,
"outputPaths": ["stdout"],
"errorOutputPaths": ["stderr"],
"encoding": "json",
"encoderConfig": {
"timeKey": "ts",
"levelKey": "level",
"nameKey": "logger",
"callerKey": "caller",
"messageKey": "msg",
"stacktraceKey": "stacktrace",
"lineEnding": "",
"levelEncoder": "",
"timeEncoder": "iso8601",
"durationEncoder": "",
"callerEncoder": ""
}
}
SERVING_LOGGING_LEVEL:
SERVING_REQUEST_LOG_TEMPLATE:
SERVING_REQUEST_METRICS_BACKEND: prometheus
TRACING_CONFIG_BACKEND: none
TRACING_CONFIG_ZIPKIN_ENDPOINT:
TRACING_CONFIG_STACKDRIVER_PROJECT_ID:
TRACING_CONFIG_DEBUG: false
TRACING_CONFIG_SAMPLE_RATE: 0.100000
USER_PORT: 8080
SYSTEM_NAMESPACE: knative-serving
METRICS_DOMAIN: knative.dev/internal/serving
USER_CONTAINER_NAME: kfserving-container
ENABLE_VAR_LOG_COLLECTION: false
VAR_LOG_VOLUME_NAME: knative-var-log
INTERNAL_VOLUME_PATH: /var/knative-internal
SERVING_READINESS_PROBE: {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
ENABLE_PROFILING: false
SERVING_ENABLE_PROBE_REQUEST_LOG: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xpccj (ro)
istio-proxy:
Container ID:
Image: docker.io/istio/proxyv2:1.1.6
Image ID:
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
resnet50-tf-fp32-predictor-default-5x7qv.$(POD_NAMESPACE)
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system:15010
--zipkinAddress
zipkin.istio-system:9411
--connectTimeout
10s
--proxyAdminPort
15000
--concurrency
2
--controlPlaneAuthPolicy
NONE
--statusPort
15020
--applicationPorts
8080,8022,9090,9091,8012
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 128Mi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
Environment:
POD_NAME: resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx (v1:metadata.name)
POD_NAMESPACE: kubeflow-tomasz-sadowski (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
ISTIO_META_POD_NAME: resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx (v1:metadata.name)
ISTIO_META_CONFIG_NAMESPACE: kubeflow-tomasz-sadowski (v1:metadata.namespace)
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_METAJSON_ANNOTATIONS: {"autoscaling.knative.dev/class":"kpa.autoscaling.knative.dev","autoscaling.knative.dev/target":"1","internal.serving.kubeflow.org/storage-initializer-sourceuri":"gs://ai-inferencing/models/resnet50-tf-fp32","queue.sidecar.serving.knative.dev/resourcePercentage":"0.2","serving.knative.dev/creator":"system:serviceaccount:kubeflow:default","traffic.sidecar.istio.io/includeOutboundIPRanges":"*"}
ISTIO_METAJSON_LABELS: {"app":"resnet50-tf-fp32-predictor-default-5x7qv","pod-template-hash":"66cd8b97cc","serving.knative.dev/configuration":"resnet50-tf-fp32-predictor-default","serving.knative.dev/configurationGeneration":"1","serving.knative.dev/revision":"resnet50-tf-fp32-predictor-default-5x7qv","serving.knative.dev/revisionUID":"9af2da45-9149-11ea-923d-42010a800034","serving.knative.dev/service":"resnet50-tf-fp32-predictor-default","serving.kubeflow.org/inferenceservice":"resnet50-tf-fp32"}
Mounts:
/etc/certs/ from istio-certs (ro)
/etc/istio/proxy from istio-envoy (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xpccj (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
knative-var-log:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
default-token-xpccj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xpccj
Optional: false
kfserving-provision-location:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.default
Optional: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 83s default-scheduler Successfully assigned kubeflow-tomasz-sadowski/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx to gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg
Normal Pulled 35s (x4 over 82s) kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
Normal Created 35s (x4 over 82s) kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg Created container storage-initializer
Normal Started 35s (x4 over 82s) kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg Started container storage-initializer
Warning BackOff 7s (x6 over 75s) kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg Back-off restarting failed container
I observed that such issue occurs for directory structure passed storage-initializer
:
-dir
-subdir
-model-file
Unfortunately such Google Storage directory structure is required by TensorFlow Serving:
- model
-model version
-model file
I tried various settings in my Google Storage but I could not remove listing models/resnet50-tf-fp32/1
- i.e. object viewer, legacy object viewer, IAM, ACL, etc. I could not also find any webpage which could help to tune access rights for Google Storage with a model so this is also the reason why I am issuing a bug.
My understanding is that KF Serving could read user Google Storage with a model containing the above-described directory structure.
- storage initializer docker image version: gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
- Istio Version:
- Knative Version:
- KFServing Version: 0.11.1
- Kubeflow version: 1.0.0 (GCP)
- Minikube version: GCP
- Kubernetes version: (use
kubectl version
): v1.14.10-gke.37 - OS (e.g. from
/etc/os-release
):
Issue Analytics
- State:
- Created 3 years ago
- Comments:16
Top GitHub Comments
I looked into this issue and it occurs when I create a directory structure using ‘create folder’ in Google Storage UI and then upload a model into such an earlier created directory structure. Google Storage do not have such a concept like folders and UI creates an object with content ‘placeholder’ as probably a workaround.
When I upload a model using
gsutil cp <directory> gs://<bucket>/<some directory>
then it works fine.However, it would be great to handle model downloading from Google Storage when directory structure and models would be uploaded by GCP UI.
Many thanks.
@ellistarn
I was able to download from GCS and recreate locally the nested model structure produced and required by Tensorflow. How do I go about patching my Kubeflow 1.3 deployment using this?
python/kfserving/kfserving/storage.py
@staticmethod def _download_gcs(uri, temp_dir: str): try: storage_client = storage.Client() except exceptions.DefaultCredentialsError: storage_client = storage.Client.create_anonymous_client() bucket_args = uri.replace(_GCS_PREFIX, "", 1).split("/", 1) bucket_name = bucket_args[0] bucket_path = bucket_args[1] if len(bucket_args) > 1 else "" bucket = storage_client.bucket(bucket_name) prefix = bucket_path if not prefix.endswith("/"): prefix = prefix + "/" blobs = bucket.list_blobs(prefix=prefix) count = 0 for blob in blobs: if blob.name.endswith("/"): continue localdir = blob.name.split('/')[0:-1] localdir = pathlib.Path(temp_dir, '/'.join(localdir)) localfil = pathlib.Path(temp_dir, blob.name) localdir.mkdir(parents=True, exist_ok=True) blob.download_to_filename(localfil) count = count + 1 if count == 0: raise RuntimeError("Failed to fetch model. \ The path or model %s does not exist." % (uri))