Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Worker thread stuck in die state

See original GitHub issue

🐛 Describe the bug

We encountered similar problems as https://github.com/pytorch/serve/issues/1531 and it happens quite often.

See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, it’s kinda stuck in an unknown state: it didn’t terminate itself, so no workers can be added automatically. But in the meantime, it won’t receive incoming traffic. which essentially means we only have one worker(9001) now.

The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it can’t receive any traffic. It still counted one active worker, thus torchserve won’t add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds. in the log )

If I curl the management API, it still shows two works are all healthy.

Error logs

worker-9000 died because exception. It didn’t have any log after 2022-08-25 21:21:14.056 PDT, selected logs:

[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected. [INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 “gRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0” 13 109 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr

Installation instructions

N/A unrelated

Model Packaing

N/A unrelated

config.properties

No response

Versions

used docker image of 0.6.0-gpu

Repro instructions

N/A

Possible Solution

No response

Issue Analytics

State:
Created a year ago
Comments:29 (4 by maintainers)

Top GitHub Comments

2reactions

hgong-snapcommented, Sep 16, 2022

@msaroufim @lxning

I can successfully repro it in my local with following setup

setup

in config.properties, I put min/max worker=2
start torchserve in local, waiting for it’s fully up
have a simple python GRPC client

import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=float(args.timeout))
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)

in one shell tab, run the client with normal timeout(2s) and very large number of requests, simulating consistent stream of normal requests):

python3 torchserve_grpc_client.py --port=17100 --timeout=2 -n=100000000

in another shell tab, run the client with very short timeout(0.02s), simulating some heavy/bad requests that will eventually timeout:

python3 torchserve_grpc_client.py --port=17100 --timeout=0.02 -n=100

Result

it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulation’s output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.

2reactions

msaroufimcommented, Sep 7, 2022

@hgong-snap confirming we can finally repro, will get back to you with a solution soon

Top Results From Across the Web

Gensim worker thread stuck - python - Stack Overflow

I am training document embeddings on a ~20 million sentences and using parallel processing in gensim. I'm creating my model and training ...

How to Handle Died Threads due to Uncaught Exceptions in ...

In concurrent applications a thread might fail and die due to uncaught runtime exceptions even without noticing since the application may continue to...

Apache and mod_wsgi, worker process stuck to W state ...

Occassionally, one of the worker processes gets stuck and all of it threads stop in writing state (as demostrated in the screenshot below)....

Even After Selecting Ignore Stuck Thread in Work Manager ...

This thread was getting stuck, and as a result the server was going to critical state and shutting down. To avoid this behavior,...

PSMQ Processor threads can get stuck if database connection ...

Jira connects to the database and continues to work normally, if configured correctly. Actual Results. SdSerialisedOffThreadProcessor:thread-% ...