question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Torchserve throws 400 `DownloadArchiveException` when the process does not have write access to model store folder

See original GitHub issue

Hello. Recently I installed eks cluster with torchserve following the tutorial https://github.com/pytorch/serve/tree/master/kubernetes/EKS, but having troubles uploading a motel.

When I try to upload a model via:

curl -X POST  "http://$HOST:8081/models?url=http%3A//54.190.129.247%3A8222/model_ubuntu_2dd0aac04a22d6a0.mar"

curl -X POST  "http://$HOST:8081/models?url=http://54.190.129.247/8222/model_ubuntu_2dd0aac04a22d6a0.mar"

I am getting the following error:

    {
     "code": 400,
     "type": "DownloadArchiveException",
     "message": "Failed to download archive from: http://54.190.129.247:8222/model_ubuntu_2dd0aac04a22d6a0.mar"
   }

Although http://54.190.129.247:8222/model_ubuntu_2dd0aac04a22d6a0.mar is a valid url.

kubectl describe pod -n default torchserve-6d4d5c8c89-zmnp9:


Name:         torchserve-6d4d5c8c89-zmnp9
Namespace:    default
Priority:     0
Node:         ip-192-168-57-45.us-west-2.compute.internal/192.168.57.45
Start Time:   Thu, 26 Aug 2021 13:13:21 -0700
Labels:       app=torchserve
              pod-template-hash=6d4d5c8c89
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.38.125
IPs:
  IP:           192.168.38.125
Controlled By:  ReplicaSet/torchserve-6d4d5c8c89
Containers:
  torchserve:
    Container ID:  docker://a64f5ef418c569249c1c05fe3056d808c2e22b79c203aed05017580bea132cc0
    Image:         pytorch/torchserve:latest
    Image ID:      docker-pullable://pytorch/torchserve@sha256:3c290c60cb89bca38fbf1d6a36ea99554b3dbb9d32cb89ed434828c5b3fd2c73
    Ports:         8080/TCP, 8081/TCP, 8082/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      torchserve
      --start
      --model-store
      /home/model-server/shared/model-store/
      --ts-config
      /home/model-server/shared/config/config.properties
    State:          Running
      Started:      Thu, 26 Aug 2021 13:13:22 -0700
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          4Gi
      nvidia.com/gpu:  0
    Requests:
      cpu:             1
      memory:          1Gi
      nvidia.com/gpu:  0
    Environment:       <none>
    Mounts:
      /home/model-server/shared/ from persistent-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-z8vb9 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  model-store-claim
    ReadOnly:   false
  default-token-z8vb9:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-z8vb9
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  17m   default-scheduler  Successfully assigned default/torchserve-6d4d5c8c89-zmnp9 to ip-192-168-57-45.us-west-2.compute.internal
  Normal  Pulled     17m   kubelet            Container image "pytorch/torchserve:latest" already present on machine
  Normal  Created    17m   kubelet            Created container torchserve
  Normal  Started    17m   kubelet            Started container torchserve

config.properties:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store

Output of the access.log

2021-08-30 01:28:19,091 [INFO ] epollEventLoopGroup-3-12 ACCESS_LOG - /192.168.24.51:2472 "POST /models?url=http://34.219.222.97:8221/model_ubuntu_09888c953c68c1fa.mar%26model_name=aivanou HTTP/1.1" 400 6
2021-08-30 01:28:19,091 [INFO ] epollEventLoopGroup-3-12 TS_METRICS - Requests4XX.Count:1|#Level:Host|#hostname:torchserve-69494c8469-8f8z8,timestamp:null
2021-08-30 01:28:20,568 [INFO ] epollEventLoopGroup-3-13 ACCESS_LOG - /192.168.32.146:61380 "POST /models?url=http://34.219.222.97:8221/model_ubuntu_09888c953c68c1fa.mar&model_name=aivanou HTTP/1.1" 400 7
2021-08-30 01:28:20,568 [INFO ] epollEventLoopGroup-3-13 TS_METRICS - Requests4XX.Count:1|#Level:Host|#hostname:torchserve-69494c8469-8f8z8,timestamp:null

ts log output:

2021-08-30 03:00:50,425 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2021-08-30 03:00:50,609 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.4.2
TS Home: /usr/local/lib/python3.6/dist-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 2048 M
Python executable: /usr/bin/python3
Config file: /home/model-server/shared/config/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/model-server/shared/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/model-server/shared/model-store
Model config: N/A
2021-08-30 03:00:50,618 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2021-08-30 03:00:50,660 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2021-08-30 03:00:50,740 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-08-30 03:00:50,740 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2021-08-30 03:00:50,742 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2021-08-30 03:00:50,742 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-08-30 03:00:50,743 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
2021-08-30 03:03:28,587 [DEBUG] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist
2021-08-30 03:03:28,588 [DEBUG] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2021-08-30 03:03:28,588 [INFO ] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelManager - Model mnist loaded.
2021-08-30 03:06:44,068 [DEBUG] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1 for model tiny_image_net_aivanou_8df333374e4d115f
2021-08-30 03:06:44,069 [DEBUG] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1 for model tiny_image_net_aivanou_8df333374e4d115f
2021-08-30 03:06:44,069 [INFO ] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelManager - Model tiny_image_net_aivanou_8df333374e4d115f loaded.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:3
  • Comments:7

github_iconTop GitHub Comments

1reaction
aivanoucommented, Aug 30, 2021

Hi @chauhang . Yes, this is 0.4.2 version. I think the main problem is not torchsevre, but the model-store pod that is used in https://github.com/pytorch/serve/tree/master/kubernetes/EKS guide. The command kubectl exec --tty pod/model-store-pod -- mkdir /pv/model-store/ creates folter model-store with permissions that are not accessible by the torchserve pod.

On the torchserve, there is an issue that if torchserve does not have write permissions to the directory that it needs to write, the service should throw 5xx exception instead of 4xx exception.

0reactions
hamzaraouzicommented, Jun 11, 2022

I am trying install torchserve on GKE cluster, but when i execute the command helm install ts . I got the following error Error: INSTALLATION FAILED: template: torchserve/templates/torchserve.yaml:51:26: executing "torchserve/templates/torchserve.yaml" at <.Values.securityContext.groupId>: nil pointer evaluating interface {}.groupId I found out that there is an open issue #1337 , i came here after tracking the pull requests on the file Helm/templates/torchserve.yaml, all I can say that after fixing the problem of volume ownership group .`

initContainers: - name: volume-ownership image: alpine:3 command: - chown - root:{{ .Values.securityContext.groupId }} - {{ .Values.torchserve.pvd_mount }}

there is something need to be added to file Helm/values.yaml about securityGroupContext.groupId.

so please checkout this open issue #1337 for more details about the problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

2. Troubleshooting Guide - PyTorch
When “InvalidSnapshotException” is thrown then the model store is in an inconsistent state as compared with the snapshot. To resolve this the snapshot...
Read more >
How to Load PyTorch Models 340 Times Faster with Ray
Tensor objects. This method takes about 1.4 seconds to load BERT, provided that the model is on local disk. That's fairly impressive for...
Read more >
How do I save a trained model in PyTorch? - Stack Overflow
The reason for this is because pickle does not save the model class itself. Rather, it saves a path to the file containing...
Read more >
Using Python for Model Inference in Deep Learning - arXiv
tion that it is not feasible to run Python for model inference. ... model each CPU core might have its own Python process....
Read more >
2 - What Is TorchServe? - AWS Workshop Studio
Deploying machine learning models for inference at scale is not easy. Developers must collect and package model artifacts, install and configure software ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found