Backend worker monitoring thread interrupted or backend worker process died.
See original GitHub issueContext
I’m testing torchserve using the tutorial provided in the link: https://github.com/pytorch/serve/tree/master/examples/image_classifier/mnist
It works perfectly fine, but when I add a new model.py file:
from mnist import Net
class ImageClassifier(Net):
def __init__(self):
super(ImageClassifier, self).__init__()
and change the archive command to:
torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/model.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler examples/image_classifier/mnist/mnist_handler.py
after executing the torchserve --start command, it returns:
Enable metrics API: true 2020-09-07 10:48:40,905 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: mnist.mar 2020-09-07 10:48:41,074 [INFO ] main org.pytorch.serve.archive.ModelArchive - eTag 4b1b109d5a834339ab1493bd98bc7d7a 2020-09-07 10:48:41,089 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist 2020-09-07 10:48:41,090 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist 2020-09-07 10:48:41,090 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model mnist loaded. 2020-09-07 10:48:41,090 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: mnist, count: 1 2020-09-07 10:48:41,111 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2020-09-07 10:48:41,234 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080 2020-09-07 10:48:41,234 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2020-09-07 10:48:41,238 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081 2020-09-07 10:48:41,238 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2020-09-07 10:48:41,240 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082 Model server started. 2020-09-07 10:48:41,276 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000 2020-09-07 10:48:41,277 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2020-09-07 10:48:41,277 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]9643 2020-09-07 10:48:41,277 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started. 2020-09-07 10:48:41,278 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.9 2020-09-07 10:48:41,278 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-mnist_1.0 State change null -> WORKER_STARTED 2020-09-07 10:48:41,284 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2020-09-07 10:48:41,304 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000. 2020-09-07 10:48:41,365 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,366 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:11.976577758789062|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,366 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:74.30361938476562|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,366 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:86.1|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,367 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:2359.8125|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,367 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:3004.60546875|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,367 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:59.7|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,810 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died. 2020-09-07 10:48:41,810 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 176, in <module> 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server() 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 148, in run_server 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket) 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 112, in handle_connection 2020-09-07 10:48:41,812 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg) 2020-09-07 10:48:41,812 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 85, in load_model 2020-09-07 10:48:41,812 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2020-09-07 10:48:41,812 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size) 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_loader.py”, line 117, in load 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - model_service.initialize(service.context) 2020-09-07 10:48:41,813 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/torch_handler/base_handler.py”, line 50, in initialize 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.model = self._load_pickled_model(model_dir, model_file, model_pt_path) 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/torch_handler/base_handler.py”, line 74, in _load_pickled_model 2020-09-07 10:48:41,813 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)
- torchserve version: 0.2.0
- torch version: 1.6.0
- java version: 11.0.8
- Operating System and version: Ubuntu 18.04.5
I think this is because the model cannot be loaded, so how do I fix this, as I see the AlexNet and ResNet tutorials use a similar method.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
@hoangminhq5310
Since you have already installed the
torchvision
package throughpip
orconda
, the resnet.py file is already available in the python path. In case of mnist example, the model network file is not available in the PYTHONPATH by default and hence needs to be made available toTorchServe
.TorchServe
extracts the model-archive (.mar file) in a temporary directory and adds it to the PYTHONPATH before creating the model worker. Thus supplying the model’s architecture file in case of your modified mnist example resolved the problem.That would be because of the following import statement in the resnet.py file from torchvision package
from .utils import load_state_dict_from_url
Which is trying to import
load_state_dict_from_url
from theutils
module and it expects thisutils
module to be available in the current directory i.e. the model’s temp directory.Closing as the query has been answered and inactivity.