question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Backend worker monitoring thread interrupted or backend worker process died.

See original GitHub issue

Context

I’m testing torchserve using the tutorial provided in the link: https://github.com/pytorch/serve/tree/master/examples/image_classifier/mnist

It works perfectly fine, but when I add a new model.py file:

from mnist import Net


class ImageClassifier(Net):
    def __init__(self):
        super(ImageClassifier, self).__init__()

and change the archive command to:

torch-model-archiver --model-name mnist --version 1.0 --model-file examples/image_classifier/mnist/model.py --serialized-file examples/image_classifier/mnist/mnist_cnn.pt --handler examples/image_classifier/mnist/mnist_handler.py

after executing the torchserve --start command, it returns:

Enable metrics API: true 2020-09-07 10:48:40,905 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: mnist.mar 2020-09-07 10:48:41,074 [INFO ] main org.pytorch.serve.archive.ModelArchive - eTag 4b1b109d5a834339ab1493bd98bc7d7a 2020-09-07 10:48:41,089 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model mnist 2020-09-07 10:48:41,090 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model mnist 2020-09-07 10:48:41,090 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model mnist loaded. 2020-09-07 10:48:41,090 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: mnist, count: 1 2020-09-07 10:48:41,111 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2020-09-07 10:48:41,234 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080 2020-09-07 10:48:41,234 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2020-09-07 10:48:41,238 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081 2020-09-07 10:48:41,238 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2020-09-07 10:48:41,240 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082 Model server started. 2020-09-07 10:48:41,276 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /tmp/.ts.sock.9000 2020-09-07 10:48:41,277 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2020-09-07 10:48:41,277 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]9643 2020-09-07 10:48:41,277 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started. 2020-09-07 10:48:41,278 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.9 2020-09-07 10:48:41,278 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-mnist_1.0 State change null -> WORKER_STARTED 2020-09-07 10:48:41,284 [INFO ] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2020-09-07 10:48:41,304 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9000. 2020-09-07 10:48:41,365 [INFO ] pool-2-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,366 [INFO ] pool-2-thread-1 TS_METRICS - DiskAvailable.Gigabytes:11.976577758789062|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,366 [INFO ] pool-2-thread-1 TS_METRICS - DiskUsage.Gigabytes:74.30361938476562|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,366 [INFO ] pool-2-thread-1 TS_METRICS - DiskUtilization.Percent:86.1|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,367 [INFO ] pool-2-thread-1 TS_METRICS - MemoryAvailable.Megabytes:2359.8125|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,367 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUsed.Megabytes:3004.60546875|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,367 [INFO ] pool-2-thread-1 TS_METRICS - MemoryUtilization.Percent:59.7|#Level:Host|#hostname:hoangminhq-GL552JX,timestamp:1599450521 2020-09-07 10:48:41,810 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died. 2020-09-07 10:48:41,810 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last): 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 176, in <module> 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - worker.run_server() 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 148, in run_server 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.handle_connection(cl_socket) 2020-09-07 10:48:41,811 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 112, in handle_connection 2020-09-07 10:48:41,812 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service, result, code = self.load_model(msg) 2020-09-07 10:48:41,812 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_service_worker.py”, line 85, in load_model 2020-09-07 10:48:41,812 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2020-09-07 10:48:41,812 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - service = model_loader.load(model_name, model_dir, handler, gpu, batch_size) 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/model_loader.py”, line 117, in load 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - model_service.initialize(service.context) 2020-09-07 10:48:41,813 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/torch_handler/base_handler.py”, line 50, in initialize 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - self.model = self._load_pickled_model(model_dir, model_file, model_pt_path) 2020-09-07 10:48:41,813 [INFO ] W-9000-mnist_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - File “/home/hoangminhq/.local/lib/python3.6/site-packages/ts/torch_handler/base_handler.py”, line 74, in _load_pickled_model 2020-09-07 10:48:41,813 [DEBUG] W-9000-mnist_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

  • torchserve version: 0.2.0
  • torch version: 1.6.0
  • java version: 11.0.8
  • Operating System and version: Ubuntu 18.04.5

I think this is because the model cannot be loaded, so how do I fix this, as I see the AlexNet and ResNet tutorials use a similar method.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
harshbafnacommented, Sep 9, 2020

@hoangminhq5310

Then the resnet.py from torchvision should also be included in the --extra-files right?

Since you have already installed the torchvision package through pip or conda, the resnet.py file is already available in the python path. In case of mnist example, the model network file is not available in the PYTHONPATH by default and hence needs to be made available to TorchServe.

TorchServe extracts the model-archive (.mar file) in a temporary directory and adds it to the PYTHONPATH before creating the model worker. Thus supplying the model’s architecture file in case of your modified mnist example resolved the problem.

I tried to put resnet.py (from torchvision), model.py (from resnet-18 example) and index_to_name.json in the same directory, then change the model.py as follow:

from resnet import ResNet, BasicBlock


class ImageClassifier(ResNet):
    def __init__(self):
        super(ImageClassifier, self).__init__(BasicBlock, [2, 2, 2, 2])

it’s not working, even when I use this command:

torch-model-archiver --model-name resnet-18 --version 1.0 --model-file model.py --serialized-file resnet18-5c106cde.pth --handler image_classifier --extra-files index_to_name.json,resnet.py

it’s not working also.

That would be because of the following import statement in the resnet.py file from torchvision package

from .utils import load_state_dict_from_url

Which is trying to import load_state_dict_from_url from the utils module and it expects this utils module to be available in the current directory i.e. the model’s temp directory.

0reactions
harshbafnacommented, Sep 19, 2020

Closing as the query has been answered and inactivity.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Backend worker monitoring thread interrupted or ... - GitHub
Backend worker monitoring thread interrupted or backend worker process died. Issue #537 · pytorch/serve · GitHub.
Read more >
Backend worker monitoring thread interrupted or backend ...
I'm testing torchserve using resnet-18 tutorial in this link: https://github.com/pytorch/serve/tree/master/examples/image_classifier/ ...
Read more >
2. Troubleshooting Guide — PyTorch/Serve master ...
2.4.4. Backend worker monitoring thread interrupted or backend worker process died error. This issue is moslty occurs when the model fails to initialize,...
Read more >
python - loading model failed in torchserving - Stack Overflow
WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException at ...
Read more >
Torch基础知识_Rory602的博客
WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found