question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Worker thread stuck in die state

See original GitHub issue

šŸ› Describe the bug

We encountered similar problems as https://github.com/pytorch/serve/issues/1531 and it happens quite often.

See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, itā€™s kinda stuck in an unknown state: it didnā€™t terminate itself, so no workers can be added automatically. But in the meantime, it wonā€™t receive incoming traffic. which essentially means we only have one worker(9001) now.

The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it canā€™t receive any traffic. It still counted one active worker, thus torchserve wonā€™t add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds. in the log )

If I curl the management API, it still shows two works are all healthy.

Error logs

worker-9000 died because exception. It didnā€™t have any log after 2022-08-25 21:21:14.056 PDT, selected logs:

[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.

io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]

[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected. [INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 ā€œgRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0ā€ 13 109 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED

[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr

Screen Shot 2022-08-25 at 10 10 38 PM Screen Shot 2022-08-25 at 10 09 24 PM

Installation instructions

N/A unrelated

Model Packaing

N/A unrelated

config.properties

No response

Versions

used docker image of 0.6.0-gpu

Repro instructions

N/A

Possible Solution

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:29 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
hgong-snapcommented, Sep 16, 2022

@msaroufim @lxning

I can successfully repro it in my local with following setup

setup

  • in config.properties, I put min/max worker=2
  • start torchserve in local, waiting for itā€™s fully up
  • have a simple python GRPC client
import sys
import grpc
import time
import argparse
import inference_pb2
import inference_pb2_grpc
import management_pb2
import management_pb2_grpc
from datetime import datetime
from google.protobuf import empty_pb2


parser = argparse.ArgumentParser()
parser.add_argument("--timeout", required=True)
parser.add_argument("--port", required=True)
parser.add_argument("-n", required=True)


if __name__ == '__main__':
    args, _ = parser.parse_known_args()
    port = args.port
    num_runs = int(args.n)

    with open('image.jpg', 'rb') as f:
        data = f.read()
    input_data = {'data': data}
    request = inference_pb2.PredictionsRequest(model_name="easyocr",
                                               input=input_data)

    channel = grpc.insecure_channel(f'localhost:{port}')
    stub = inference_pb2_grpc.InferenceAPIsServiceStub(channel)
    for i in range(num_runs):
        try:
            start = datetime.now()
            response = stub.Predictions(request, timeout=float(args.timeout))
            print("request time:", datetime.now() - start)
            print(response)
        except Exception as e:
            print(e)

  • in one shell tab, run the client with normal timeout(2s) and very large number of requests, simulating consistent stream of normal requests):
python3 torchserve_grpc_client.py --port=17100 --timeout=2 -n=100000000
  • in another shell tab, run the client with very short timeout(0.02s), simulating some heavy/bad requests that will eventually timeout:
python3 torchserve_grpc_client.py --port=17100 --timeout=0.02 -n=100

Result

it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulationā€™s output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.

2reactions
msaroufimcommented, Sep 7, 2022

@hgong-snap confirming we can finally repro, will get back to you with a solution soon

Read more comments on GitHub >

github_iconTop Results From Across the Web

Gensim worker thread stuck - python - Stack Overflow
I am training document embeddings on a ~20 million sentences and using parallel processing in gensim. I'm creating my model and trainingĀ ...
Read more >
How to Handle Died Threads due to Uncaught Exceptions in ...
In concurrent applications a thread might fail and die due to uncaught runtime exceptions even without noticing since the application may continue to...
Read more >
Apache and mod_wsgi, worker process stuck to W state ...
Occassionally, one of the worker processes gets stuck and all of it threads stop in writing state (as demostrated in the screenshot below)....
Read more >
Even After Selecting Ignore Stuck Thread in Work Manager ...
This thread was getting stuck, and as a result the server was going to critical state and shutting down. To avoid this behavior,...
Read more >
PSMQ Processor threads can get stuck if database connection ...
Jira connects to the database and continues to work normally, if configured correctly. Actual Results. SdSerialisedOffThreadProcessor:thread-%Ā ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found