question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stub process is unhealthy and it will be restarted

See original GitHub issue

I’m getting Stub process is unhealthy and it will be restarted repeatedly when calling infer, after which the server restarts. I have deployed triton server on GKE with 3 models.

1st time when I infer model1 I get this error, 2nd and consequent hits don’t give this error. But if I infer model2 after getting successful result from model1 then again this error pops up and so on for model3.

logs:

responses.append(self.triton_client.infer(
      File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1086, in infer
        raise_error_grpc(rpc_error)
      File "/home/swapnesh/triton/triton_env/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 61, in raise_error_grpc
        raise get_error_grpc(rpc_error) from None
    tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] Failed to process the request(s) for model instance 'damage_0', message: Stub process is not healthy.

I’m loading 3 models, using the python backend and custom triton image (converted detectron models) which I’ve built using this Dockerfile:

FROM nvcr.io/nvidia/tritonserver:21.10-py3

RUN pip3 install torch==1.9.1 torchvision==0.10.1 torchaudio==0.9.1 && \
    pip3 install pillow

Also, while running triton server locally using docker, I had to increase the shm-size as it was giving error to increase it from 64MB. On Kubernetes its a little tricky, to use emptyDir with Memory medium. My yaml looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: triton-mms
  name: triton-mms
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-mms
  template:
    metadata:
      labels:
        app: triton-mms
    spec:
      containers:
      - image: <custom triton image>
        command: ["/bin/sh", "-c"]
        args: ["tritonserver --model-repository=<gcs model repo>"]
        imagePullPolicy: IfNotPresent
        name: triton-mms
        ports:
        - containerPort: 8000
          name: http-triton
        - containerPort: 8001
          name: grpc-triton
        - containerPort: 8002
          name: metrics-triton
        env:
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /secret/gcp-creds.json
        resources:
          limits:
            memory: 5Gi
            nvidia.com/gpu: 1
          requests:
            memory: 5Gi
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
        - name: vsecret
          mountPath: "/secret"
          readOnly: true
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "1024Mi"
      - name: vsecret
        secret:
          secretName: gcpcreds

Never faced this issue before and I’m thinking it might be related to shared memory as I’ve never seen that error too.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:35 (12 by maintainers)

github_iconTop GitHub Comments

3reactions
s-rogcommented, Feb 25, 2022

@Tabrizian we managed to get the model working in torchscript (torch backend) and no longer experience this issue

1reaction
zhyj3038commented, Feb 24, 2022

just enlarge the Memory of container in Kubernetes will solve it

Read more comments on GitHub >

github_iconTop Results From Across the Web

[SOLVED] The Stub Received Bad Data Error Problem Issue
In the first method, we try to perform a system file scan to fix The Stub Received Bad Data windows 10 issue. We...
Read more >
Triton Inference Server: The Basics and a Quick Tutorial
Learn about the NVIDIA Triton Inference Server, its key features, models and model repositories, client libraries, and get started with a quick tutorial....
Read more >
grpc/grpc - Gitter
AbstractStub $StubFactory , the grpc-core jar file doesn't contain that function for some reason in the version that github readme specified. I' ...
Read more >
Troubleshooting stubs - IBM
The most likely cause is that the client already received a reply from another Send Reply action within the stub and its connection...
Read more >
Cascading Failures - Google - Site Reliability Engineering
For example, if a service was healthy at 10,000 QPS, but started a cascading failure due to crashes at 11,000 QPS, dropping the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found