Torchserve throws 400 `DownloadArchiveException` when the process does not have write access to model store folder
See original GitHub issueHello. Recently I installed eks cluster with torchserve following the tutorial https://github.com/pytorch/serve/tree/master/kubernetes/EKS, but having troubles uploading a motel.
When I try to upload a model via:
curl -X POST "http://$HOST:8081/models?url=http%3A//54.190.129.247%3A8222/model_ubuntu_2dd0aac04a22d6a0.mar"
curl -X POST "http://$HOST:8081/models?url=http://54.190.129.247/8222/model_ubuntu_2dd0aac04a22d6a0.mar"
I am getting the following error:
{
"code": 400,
"type": "DownloadArchiveException",
"message": "Failed to download archive from: http://54.190.129.247:8222/model_ubuntu_2dd0aac04a22d6a0.mar"
}
Although http://54.190.129.247:8222/model_ubuntu_2dd0aac04a22d6a0.mar
is a valid url.
kubectl describe pod -n default torchserve-6d4d5c8c89-zmnp9:
Name: torchserve-6d4d5c8c89-zmnp9
Namespace: default
Priority: 0
Node: ip-192-168-57-45.us-west-2.compute.internal/192.168.57.45
Start Time: Thu, 26 Aug 2021 13:13:21 -0700
Labels: app=torchserve
pod-template-hash=6d4d5c8c89
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.38.125
IPs:
IP: 192.168.38.125
Controlled By: ReplicaSet/torchserve-6d4d5c8c89
Containers:
torchserve:
Container ID: docker://a64f5ef418c569249c1c05fe3056d808c2e22b79c203aed05017580bea132cc0
Image: pytorch/torchserve:latest
Image ID: docker-pullable://pytorch/torchserve@sha256:3c290c60cb89bca38fbf1d6a36ea99554b3dbb9d32cb89ed434828c5b3fd2c73
Ports: 8080/TCP, 8081/TCP, 8082/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
Args:
torchserve
--start
--model-store
/home/model-server/shared/model-store/
--ts-config
/home/model-server/shared/config/config.properties
State: Running
Started: Thu, 26 Aug 2021 13:13:22 -0700
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 4Gi
nvidia.com/gpu: 0
Requests:
cpu: 1
memory: 1Gi
nvidia.com/gpu: 0
Environment: <none>
Mounts:
/home/model-server/shared/ from persistent-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-z8vb9 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
persistent-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: model-store-claim
ReadOnly: false
default-token-z8vb9:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-z8vb9
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 17m default-scheduler Successfully assigned default/torchserve-6d4d5c8c89-zmnp9 to ip-192-168-57-45.us-west-2.compute.internal
Normal Pulled 17m kubelet Container image "pytorch/torchserve:latest" already present on machine
Normal Created 17m kubelet Created container torchserve
Normal Started 17m kubelet Started container torchserve
config.properties:
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
Output of the access.log
2021-08-30 01:28:19,091 [INFO ] epollEventLoopGroup-3-12 ACCESS_LOG - /192.168.24.51:2472 "POST /models?url=http://34.219.222.97:8221/model_ubuntu_09888c953c68c1fa.mar%26model_name=aivanou HTTP/1.1" 400 6
2021-08-30 01:28:19,091 [INFO ] epollEventLoopGroup-3-12 TS_METRICS - Requests4XX.Count:1|#Level:Host|#hostname:torchserve-69494c8469-8f8z8,timestamp:null
2021-08-30 01:28:20,568 [INFO ] epollEventLoopGroup-3-13 ACCESS_LOG - /192.168.32.146:61380 "POST /models?url=http://34.219.222.97:8221/model_ubuntu_09888c953c68c1fa.mar&model_name=aivanou HTTP/1.1" 400 7
2021-08-30 01:28:20,568 [INFO ] epollEventLoopGroup-3-13 TS_METRICS - Requests4XX.Count:1|#Level:Host|#hostname:torchserve-69494c8469-8f8z8,timestamp:null
ts log output:
2021-08-30 03:00:50,425 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
2021-08-30 03:00:50,609 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.4.2
TS Home: /usr/local/lib/python3.6/dist-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 2
Max heap size: 2048 M
Python executable: /usr/bin/python3
Config file: /home/model-server/shared/config/config.properties
Inference address: http://0.0.0.0:8080
Management address: http://0.0.0.0:8081
Metrics address: http://127.0.0.1:8082
Model Store: /home/model-server/shared/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 32
Netty client threads: 0
Default workers per model: 2
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Prefer direct buffer: false
Allowed Urls: [file://.*|http(s)?://.*]
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /home/model-server/shared/model-store
Model config: N/A
2021-08-30 03:00:50,618 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin...
2021-08-30 03:00:50,660 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2021-08-30 03:00:50,740 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2021-08-30 03:00:50,740 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2021-08-30 03:00:50,742 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://0.0.0.0:8081
2021-08-30 03:00:50,742 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-08-30 03:00:50,743 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
2021-08-30 03:03:28,587 [DEBUG] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist
2021-08-30 03:03:28,588 [DEBUG] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist
2021-08-30 03:03:28,588 [INFO ] epollEventLoopGroup-3-18 org.pytorch.serve.wlm.ModelManager - Model mnist loaded.
2021-08-30 03:06:44,068 [DEBUG] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1 for model tiny_image_net_aivanou_8df333374e4d115f
2021-08-30 03:06:44,069 [DEBUG] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1 for model tiny_image_net_aivanou_8df333374e4d115f
2021-08-30 03:06:44,069 [INFO ] epollEventLoopGroup-3-13 org.pytorch.serve.wlm.ModelManager - Model tiny_image_net_aivanou_8df333374e4d115f loaded.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:7
Top Results From Across the Web
2. Troubleshooting Guide - PyTorch
When “InvalidSnapshotException” is thrown then the model store is in an inconsistent state as compared with the snapshot. To resolve this the snapshot...
Read more >How to Load PyTorch Models 340 Times Faster with Ray
Tensor objects. This method takes about 1.4 seconds to load BERT, provided that the model is on local disk. That's fairly impressive for...
Read more >How do I save a trained model in PyTorch? - Stack Overflow
The reason for this is because pickle does not save the model class itself. Rather, it saves a path to the file containing...
Read more >Using Python for Model Inference in Deep Learning - arXiv
tion that it is not feasible to run Python for model inference. ... model each CPU core might have its own Python process....
Read more >2 - What Is TorchServe? - AWS Workshop Studio
Deploying machine learning models for inference at scale is not easy. Developers must collect and package model artifacts, install and configure software ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @chauhang . Yes, this is 0.4.2 version. I think the main problem is not torchsevre, but the
model-store
pod that is used in https://github.com/pytorch/serve/tree/master/kubernetes/EKS guide. The commandkubectl exec --tty pod/model-store-pod -- mkdir /pv/model-store/
creates foltermodel-store
with permissions that are not accessible by thetorchserve
pod.On the torchserve, there is an issue that if torchserve does not have write permissions to the directory that it needs to write, the service should throw 5xx exception instead of 4xx exception.
I am trying install torchserve on GKE cluster, but when i execute the command
helm install ts .
I got the following errorError: INSTALLATION FAILED: template: torchserve/templates/torchserve.yaml:51:26: executing "torchserve/templates/torchserve.yaml" at <.Values.securityContext.groupId>: nil pointer evaluating interface {}.groupId
I found out that there is an open issue #1337 , i came here after tracking the pull requests on the fileHelm/templates/torchserve.yaml
, all I can say that after fixing the problem of volume ownership group .`there is something need to be added to file
Helm/values.yaml
about securityGroupContext.groupId.so please checkout this open issue #1337 for more details about the problem.