Worker thread stuck in die state
See original GitHub issueš Describe the bug
We encountered similar problems as https://github.com/pytorch/serve/issues/1531 and it happens quite often.
See logs below. We have two workers(9000, 9001) for this model. After worker 9000 got an exception, itās kinda stuck in an unknown state: it didnāt terminate itself, so no workers can be added automatically. But in the meantime, it wonāt receive incoming traffic. which essentially means we only have one worker(9001) now.
The problem is that: this worker is in a stuck state: it is not destruct itself(successfully) and it canāt receive any traffic. It still counted one active worker, thus torchserve wonāt add more worker(because current # worker=2). Normally the worker would die and torchserve will retry the worker (e.g. found Retry worker: 9001 in 1 seconds.
in the log )
If I curl the management API, it still shows two works are all healthy.
Error logs
worker-9000 died because exception. It didnāt have any log after 2022-08-25 21:21:14.056 PDT, selected logs:
[WARN ] W-9000-easyocr_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker thread exception.
io.grpc.StatusRuntimeException: CANCELLED: call already cancelled at io.grpc.Status.asRuntimeException(Status.java:524) ~[model-server.jar:?] at io.grpc.stub.ServerCalls$ServerCallStreamObserverImpl.onNext(ServerCalls.java:335) ~[model-server.jar:?] at org.pytorch.serve.job.GRPCJob.response(GRPCJob.java:66) ~[model-server.jar:?] at org.pytorch.serve.wlm.BatchAggregator.sendResponse(BatchAggregator.java:74) ~[model-server.jar:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:195) ~[model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?]
[INFO ] W-9000-easyocr_1.0-stdout MODEL_LOG - Frontend disconnected. [INFO ] W-9000-easyocr_1.0 ACCESS_LOG - /127.0.0.1:40592 āgRPC org.pytorch.serve.grpc.inference.InferenceAPIsService/Predictions HTTP/2.0ā 13 109 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_MODEL_LOADED
[INFO ] W-9000-easyocr_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-easyocr_1.0-stderr
Installation instructions
N/A unrelated
Model Packaing
N/A unrelated
config.properties
No response
Versions
used docker image of 0.6.0-gpu
Repro instructions
N/A
Possible Solution
No response
Issue Analytics
- State:
- Created a year ago
- Comments:29 (4 by maintainers)
Top GitHub Comments
@msaroufim @lxning
I can successfully repro it in my local with following setup
setup
config.properties
, I put min/max worker=2Result
it takes torchserve several seconds to several minutes to resolve the issue and tab1 simulationās output is back to normal, but also likely that all workers will stuck forever and tab1 will see flood of errors.
@hgong-snap confirming we can finally repro, will get back to you with a solution soon