Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issue with model downloading from Google Storage when model is in subdirectory

See original GitHub issue

/kind bug

I used GCP for Kubeflow deployment and I also put my model in the Google Storage of the same project in GCP.

I prepared sample deployment based on https://github.com/kubeflow/kfserving/blob/master/docs/samples/tensorflow/tensorflow.yaml When I use storageUri: "gs://kfserving-samples/models/tensorflow/flowers" and I change port deployment works correctly.

However, when I would like to use my own prepared Google Storage I receive an error in kfserving/storage-initializer docker container:

FileExistsError: [Errno 17] File exists: './1'

and model cannot be read and the whole deployment returns error RevisionMissing.

Google Storage model path: gs://<MY-BUCKET>/models/resnet50-tf-fp32/1/saved_model.pb

Below deployment.yaml:

apiVersion: "serving.kubeflow.org/v1alpha2"
kind: "InferenceService"
metadata:
  name: "resnet50-tf-fp32"
spec:
  default:
    predictor:
      tensorflow:
        storageUri: "gs://<MY-BUCKET>/models/resnet50-tf-fp32"
        ports:
          - name: h2c
            containerPort: 8500

I debugged kfserving/storage-initializer docker container and bucket.list_blobs from https://github.com/kubeflow/kfserving/blob/master/python/kfserving/kfserving/storage.py#L98 received for by bucket:

<Blob: <MY-BUCKET>, models/resnet50-tf-fp32/>, 
<Blob:  <MY-BUCKET>, models/resnet50-tf-fp32/1/>, 
<Blob:  <MY-BUCKET>, models/resnet50-tf-fp32/1/saved_model.pb]

During processing models/resnet50-tf-fp32/1/ this line: https://github.com/kubeflow/kfserving/blob/master/python/kfserving/kfserving/storage.py#L112 creates local file 1 with content placeholder which is a culprit.

Then during processing models/resnet50-tf-fp32/1/saved_model.pb application is trying to create 1 directory and receive an error as the file with the same name was created in previous step. Exact line: https://github.com/kubeflow/kfserving/blob/master/python/kfserving/kfserving/storage.py#L108

I compared it with working example for gs://kfserving-samples/models/tensorflow/flowers and bucket.list_blobs returns:

<Blob: kfserving-samples, models/tensorflow/flowers/0001/saved_model.pb, 1557514652201269>, 
<Blob: kfserving-samples, models/tensorflow/flowers/0001/variables/variables.data-00000-of-00001, 1557514652575363>, 
<Blob: kfserving-samples, models/tensorflow/flowers/0001/variables/variables.index, 1557514652853907>

I works as no object with directory models/tensorflow/flowers/0001 was returned and processed by model-initializer.

kubectl get all
NAME                                                                  READY   STATUS       RESTARTS   AGE
pod/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8bmh8rn   0/3     Init:Error   1          18s

NAME                                                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                             AGE
service/resnet50-tf-fp32-predictor-default-5x7qv           ClusterIP   10.11.244.141   <none>        80/TCP                              19s
service/resnet50-tf-fp32-predictor-default-5x7qv-private   ClusterIP   10.11.253.131   <none>        80/TCP,9090/TCP,9091/TCP,8022/TCP   19s

NAME                                                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/resnet50-tf-fp32-predictor-default-5x7qv-deployment   0/1     1            0           19s

NAME                                                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8b97cc   1         1         0       19s

NAME                                                           URL                                                                              READY     REASON
route.serving.knative.dev/resnet50-tf-fp32-predictor-default   http://resnet50-tf-fp32-predictor-default.kubeflow-tomasz-sadowski.example.com   Unknown   RevisionMissing

NAME                                                                    CONFIG NAME                          K8S SERVICE NAME                           GENERATION   READY     REASON
revision.serving.knative.dev/resnet50-tf-fp32-predictor-default-5x7qv   resnet50-tf-fp32-predictor-default   resnet50-tf-fp32-predictor-default-5x7qv   1            Unknown   Deploying

NAME                                                                   LATESTCREATED                              LATESTREADY   READY     REASON
configuration.serving.knative.dev/resnet50-tf-fp32-predictor-default   resnet50-tf-fp32-predictor-default-5x7qv                 Unknown   

NAME                                                             URL                                                                              LATESTCREATED                              LATESTREADY   READY     REASON
service.serving.knative.dev/resnet50-tf-fp32-predictor-default   http://resnet50-tf-fp32-predictor-default.kubeflow-tomasz-sadowski.example.com   resnet50-tf-fp32-predictor-default-5x7qv                 Unknown   RevisionMissing

kubectl describe pod/....
Name:           resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx
Namespace:      kubeflow-tomasz-sadowski
Priority:       0
Node:           gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg/10.128.15.219
Start Time:     Fri, 08 May 2020 18:36:50 +0200
Labels:         app=resnet50-tf-fp32-predictor-default-5x7qv
                pod-template-hash=66cd8b97cc
                serving.knative.dev/configuration=resnet50-tf-fp32-predictor-default
                serving.knative.dev/configurationGeneration=1
                serving.knative.dev/revision=resnet50-tf-fp32-predictor-default-5x7qv
                serving.knative.dev/revisionUID=9af2da45-9149-11ea-923d-42010a800034
                serving.knative.dev/service=resnet50-tf-fp32-predictor-default
                serving.kubeflow.org/inferenceservice=resnet50-tf-fp32
Annotations:    autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
                autoscaling.knative.dev/target: 1
                internal.serving.kubeflow.org/storage-initializer-sourceuri: gs://ai-inferencing/models/resnet50-tf-fp32
                queue.sidecar.serving.knative.dev/resourcePercentage: 0.2
                serving.knative.dev/creator: system:serviceaccount:kubeflow:default
                sidecar.istio.io/status:
                  {"version":"5f3ae3613c7945ef767cb9fd594596bc001ff3ab915f12e4379c0cb5648d2729","initContainers":["istio-init"],"containers":["istio-proxy"]...
                traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status:         Pending
IP:             10.8.1.91
IPs:            <none>
Controlled By:  ReplicaSet/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8b97cc
Init Containers:
  storage-initializer:
    Container ID:  docker://5e9e492a9e899b4d29ac128c2b5815353372f8892f710ddef18bed20b2ff9107
    Image:         gcr.io/kfserving/storage-initializer:0.2.2
    Image ID:      docker-pullable://gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://ai-inferencing/models/resnet50-tf-fp32
      /mnt/models
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 08 May 2020 18:37:38 +0200
      Finished:     Fri, 08 May 2020 18:37:39 +0200
    Ready:          False
    Restart Count:  3
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from kfserving-provision-location (rw)
  istio-init:
    Container ID:  
    Image:         docker.io/istio/proxy_init:1.1.6
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      -p
      15001
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      8080,8022,9090,9091,8012
      -d
      15020
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        10m
      memory:     10Mi
    Environment:  <none>
    Mounts:       <none>
Containers:
  kfserving-container:
    Container ID:  
    Image:         index.docker.io/tensorflow/serving@sha256:f7e59a29cbc17a6b507751cddde37bccad4407c05ebf2c13b8e6ccb7d2e9affb
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /usr/bin/tensorflow_model_server
    Args:
      --port=9000
      --rest_api_port=8080
      --model_name=resnet50-tf-fp32
      --model_base_path=/mnt/models
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     1
      memory:  2Gi
    Environment:
      PORT:             8080
      K_REVISION:       resnet50-tf-fp32-predictor-default-5x7qv
      K_CONFIGURATION:  resnet50-tf-fp32-predictor-default
      K_SERVICE:        resnet50-tf-fp32-predictor-default
    Mounts:
      /mnt/models from kfserving-provision-location (ro)
      /var/log from knative-var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xpccj (ro)
  queue-proxy:
    Container ID:   
    Image:          gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:792f6945c7bc73a49a470a5b955c39c8bd174705743abf5fb71aa0f4c04128eb
    Image ID:       
    Ports:          8022/TCP, 9090/TCP, 9091/TCP, 8012/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     40m
      memory:  200Mi
    Requests:
      cpu:      25m
      memory:   50Mi
    Readiness:  exec [/ko-app/queue -probe-period 0] delay=0s timeout=10s period=1s #success=1 #failure=3
    Environment:
      SERVING_NAMESPACE:                      kubeflow-tomasz-sadowski
      SERVING_SERVICE:                        resnet50-tf-fp32-predictor-default
      SERVING_CONFIGURATION:                  resnet50-tf-fp32-predictor-default
      SERVING_REVISION:                       resnet50-tf-fp32-predictor-default-5x7qv
      QUEUE_SERVING_PORT:                     8012
      CONTAINER_CONCURRENCY:                  0
      REVISION_TIMEOUT_SECONDS:               60
      SERVING_POD:                            resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx (v1:metadata.name)
      SERVING_POD_IP:                          (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                 {
                                                "level": "info",
                                                "development": false,
                                                "outputPaths": ["stdout"],
                                                "errorOutputPaths": ["stderr"],
                                                "encoding": "json",
                                                "encoderConfig": {
                                                  "timeKey": "ts",
                                                  "levelKey": "level",
                                                  "nameKey": "logger",
                                                  "callerKey": "caller",
                                                  "messageKey": "msg",
                                                  "stacktraceKey": "stacktrace",
                                                  "lineEnding": "",
                                                  "levelEncoder": "",
                                                  "timeEncoder": "iso8601",
                                                  "durationEncoder": "",
                                                  "callerEncoder": ""
                                                }
                                              }
      SERVING_LOGGING_LEVEL:                  
      SERVING_REQUEST_LOG_TEMPLATE:           
      SERVING_REQUEST_METRICS_BACKEND:        prometheus
      TRACING_CONFIG_BACKEND:                 none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:         
      TRACING_CONFIG_STACKDRIVER_PROJECT_ID:  
      TRACING_CONFIG_DEBUG:                   false
      TRACING_CONFIG_SAMPLE_RATE:             0.100000
      USER_PORT:                              8080
      SYSTEM_NAMESPACE:                       knative-serving
      METRICS_DOMAIN:                         knative.dev/internal/serving
      USER_CONTAINER_NAME:                    kfserving-container
      ENABLE_VAR_LOG_COLLECTION:              false
      VAR_LOG_VOLUME_NAME:                    knative-var-log
      INTERNAL_VOLUME_PATH:                   /var/knative-internal
      SERVING_READINESS_PROBE:                {"tcpSocket":{"port":8080,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                       false
      SERVING_ENABLE_PROBE_REQUEST_LOG:       false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xpccj (ro)
  istio-proxy:
    Container ID:  
    Image:         docker.io/istio/proxyv2:1.1.6
    Image ID:      
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --configPath
      /etc/istio/proxy
      --binaryPath
      /usr/local/bin/envoy
      --serviceCluster
      resnet50-tf-fp32-predictor-default-5x7qv.$(POD_NAMESPACE)
      --drainDuration
      45s
      --parentShutdownDuration
      1m0s
      --discoveryAddress
      istio-pilot.istio-system:15010
      --zipkinAddress
      zipkin.istio-system:9411
      --connectTimeout
      10s
      --proxyAdminPort
      15000
      --concurrency
      2
      --controlPlaneAuthPolicy
      NONE
      --statusPort
      15020
      --applicationPorts
      8080,8022,9090,9091,8012
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  128Mi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
    Environment:
      POD_NAME:                      resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow-tomasz-sadowski (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      ISTIO_META_POD_NAME:           resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx (v1:metadata.name)
      ISTIO_META_CONFIG_NAMESPACE:   kubeflow-tomasz-sadowski (v1:metadata.namespace)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"autoscaling.knative.dev/class":"kpa.autoscaling.knative.dev","autoscaling.knative.dev/target":"1","internal.serving.kubeflow.org/storage-initializer-sourceuri":"gs://ai-inferencing/models/resnet50-tf-fp32","queue.sidecar.serving.knative.dev/resourcePercentage":"0.2","serving.knative.dev/creator":"system:serviceaccount:kubeflow:default","traffic.sidecar.istio.io/includeOutboundIPRanges":"*"}
                                     
      ISTIO_METAJSON_LABELS:         {"app":"resnet50-tf-fp32-predictor-default-5x7qv","pod-template-hash":"66cd8b97cc","serving.knative.dev/configuration":"resnet50-tf-fp32-predictor-default","serving.knative.dev/configurationGeneration":"1","serving.knative.dev/revision":"resnet50-tf-fp32-predictor-default-5x7qv","serving.knative.dev/revisionUID":"9af2da45-9149-11ea-923d-42010a800034","serving.knative.dev/service":"resnet50-tf-fp32-predictor-default","serving.kubeflow.org/inferenceservice":"resnet50-tf-fp32"}
                                     
    Mounts:
      /etc/certs/ from istio-certs (ro)
      /etc/istio/proxy from istio-envoy (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xpccj (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  knative-var-log:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-xpccj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-xpccj
    Optional:    false
  kfserving-provision-location:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.default
    Optional:    true
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                From                                                      Message
  ----     ------     ----               ----                                                      -------
  Normal   Scheduled  83s                default-scheduler                                         Successfully assigned kubeflow-tomasz-sadowski/resnet50-tf-fp32-predictor-default-5x7qv-deployment-66cd8brw7wx to gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg
  Normal   Pulled     35s (x4 over 82s)  kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg  Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
  Normal   Created    35s (x4 over 82s)  kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg  Created container storage-initializer
  Normal   Started    35s (x4 over 82s)  kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg  Started container storage-initializer
  Warning  BackOff    7s (x6 over 75s)   kubelet, gke-kubeflow-kubeflow-cpu-pool-v1-b948cd61-9blg  Back-off restarting failed container

I observed that such issue occurs for directory structure passed storage-initializer:

-dir
      -subdir
                -model-file

Unfortunately such Google Storage directory structure is required by TensorFlow Serving:

- model
       -model version
                 -model file

I tried various settings in my Google Storage but I could not remove listing models/resnet50-tf-fp32/1 - i.e. object viewer, legacy object viewer, IAM, ACL, etc. I could not also find any webpage which could help to tune access rights for Google Storage with a model so this is also the reason why I am issuing a bug.

My understanding is that KF Serving could read user Google Storage with a model containing the above-described directory structure.

storage initializer docker image version: gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
Istio Version:
Knative Version:
KFServing Version: 0.11.1
Kubeflow version: 1.0.0 (GCP)
Minikube version: GCP
Kubernetes version: (use kubectl version): v1.14.10-gke.37
OS (e.g. from /etc/os-release):

Issue Analytics

State:
Created 3 years ago
Comments:16

Top GitHub Comments

5reactions

tomasz-sadowskicommented, May 19, 2020

I looked into this issue and it occurs when I create a directory structure using ‘create folder’ in Google Storage UI and then upload a model into such an earlier created directory structure. Google Storage do not have such a concept like folders and UI creates an object with content ‘placeholder’ as probably a workaround.

When I upload a model using gsutil cp <directory> gs://<bucket>/<some directory> then it works fine.

However, it would be great to handle model downloading from Google Storage when directory structure and models would be uploaded by GCP UI.

Many thanks.

1reaction

edi-bice-bycommented, Jun 29, 2021

@ellistarn

I was able to download from GCS and recreate locally the nested model structure produced and required by Tensorflow. How do I go about patching my Kubeflow 1.3 deployment using this?

python/kfserving/kfserving/storage.py

@staticmethod def _download_gcs(uri, temp_dir: str): try: storage_client = storage.Client() except exceptions.DefaultCredentialsError: storage_client = storage.Client.create_anonymous_client() bucket_args = uri.replace(_GCS_PREFIX, "", 1).split("/", 1) bucket_name = bucket_args[0] bucket_path = bucket_args[1] if len(bucket_args) > 1 else "" bucket = storage_client.bucket(bucket_name) prefix = bucket_path if not prefix.endswith("/"): prefix = prefix + "/" blobs = bucket.list_blobs(prefix=prefix) count = 0 for blob in blobs: if blob.name.endswith("/"): continue localdir = blob.name.split('/')[0:-1] localdir = pathlib.Path(temp_dir, '/'.join(localdir)) localfil = pathlib.Path(temp_dir, blob.name) localdir.mkdir(parents=True, exist_ok=True) blob.download_to_filename(localfil) count = count + 1 if count == 0: raise RuntimeError("Failed to fetch model. \ The path or model %s does not exist." % (uri))