Stub process is unhealthy and it will be restarted
See original GitHub issueI’m getting Stub process is unhealthy and it will be restarted
repeatedly when calling infer
, after which the server restarts. I have deployed triton server on GKE with 3 models.
1st time when I infer model1
I get this error, 2nd and consequent hits don’t give this error. But if I infer model2
after getting successful result from model1
then again this error pops up and so on for model3
.
logs:
responses.append(self.triton_client.infer(
File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1086, in infer
raise_error_grpc(rpc_error)
File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 61, in raise_error_grpc
raise get_error_grpc(rpc_error) from None
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'damage_0', message: Stub process is not healthy.
I’m loading 3 models, using the python backend and custom triton image (converted detectron models) which I’ve built using this Dockerfile:
FROM nvcr.io/nvidia/tritonserver:21.10-py3
RUN pip3 install torch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 && \
pip3 install pillow
Also, while running triton server locally using docker, I had to increase the shm-size
as it was giving error to increase it from 64MB. On Kubernetes its a little tricky, to use emptyDir
with Memory
medium. My yaml looks like this:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: triton-mms
name: triton-mms
spec:
replicas: 1
selector:
matchLabels:
app: triton-mms
template:
metadata:
labels:
app: triton-mms
spec:
containers:
- image: <custom triton image>
command: ["/bin/sh", "-c"]
args: ["tritonserver --model-repository=<gcs model repo>"]
imagePullPolicy: IfNotPresent
name: triton-mms
ports:
- containerPort: 8000
name: http-triton
- containerPort: 8001
name: grpc-triton
- containerPort: 8002
name: metrics-triton
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /secret/gcp-creds.json
resources:
limits:
memory: 5Gi
nvidia.com/gpu: 1
requests:
memory: 5Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: vsecret
mountPath: "/secret"
readOnly: true
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "1024Mi"
- name: vsecret
secret:
secretName: gcpcreds
Never faced this issue before and I’m thinking it might be related to shared memory
as I’ve never seen that error too.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:35 (12 by maintainers)
Top GitHub Comments
@Tabrizian we managed to get the model working in torchscript (torch backend) and no longer experience this issue
just enlarge the Memory of container in Kubernetes will solve it