question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItĀ collects links to all the places you might be looking at while hunting down a tough bug.

And, if youā€™re still stuck at the end, weā€™re happy to hop on a call to see how we can help out.

Model not loaded - MODEL_LOG - what(): std::bad_alloc - Torchserve version: 0.5.2

See original GitHub issue

šŸ› Describe the bug

Torschserve:: Name: torch-model-archiver Version: 0.6.0 The command to create the MAR file: torch-model-archiver --model-name bodypartrec --version 1.0 --serialized-file model.pt --handler model_handler.py --requirements-file requir/requirements.txt

KServe: Inferenceservice:

apiVersion: ā€œserving.kserve.io/v1beta1ā€ kind: ā€œInferenceServiceā€ metadata: namespace: ā€œkserve-testā€ name: ā€œmy-modelā€ spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: ā€œv2ā€ storageUri: ā€œs3://ā€¦/kserve/ā€

The error when torchserve is started with kserve is as described in the logs. So basically: ā€œwhat(): std::bad_allocā€ is happening. There are no clues on how to proceed with. I saw that this issue is marked as a bug, but I donā€™t know in what case this is triggered!? Any help is really appreciated!

Error logs

Defaulted container ā€œkserve-containerā€ out of: kserve-container, queue-proxy, storage-initializer (init) WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. /usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'. Please update your install command. warnings.warn( 2022-06-01T13:52:17,164 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins managerā€¦ 2022-06-01T13:52:17,276 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.5.2 TS Home: /usr/local/lib/python3.8/dist-packages Current directory: /home/model-server Temp directory: /home/model-server/tmp Number of GPUs: 0 Number of CPUs: 24 Max heap size: 512 M Python executable: /usr/bin/python Config file: /mnt/models/config/config.properties Inference address: http://0.0.0.0:8085 Management address: http://0.0.0.0:8085 Metrics address: http://0.0.0.0:8082 Model Store: /mnt/models/model-store Initial Models: N/A Log dir: /home/model-server/logs Metrics dir: /home/model-server/logs Netty threads: 4 Netty client threads: 0 Default workers per model: 24 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.|http(s)?šŸ˜•/.] Custom python dependency for model allowed: true Metrics report format: prometheus Enable metrics API: true Workflow Store: /mnt/models/model-store Model config: N/A 2022-06-01T13:52:17,283 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer pluginā€¦ 2022-06-01T13:52:17,321 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {ā€œnameā€:ā€œstartup.cfgā€,ā€œmodelCountā€:1,ā€œmodelsā€:{ā€œmy-modelā€:{ā€œ1.0ā€:{ā€œdefaultVersionā€:true,ā€œmarNameā€:ā€œbodypartrec.marā€,ā€œminWorkersā€:1,ā€œmaxWorkersā€:5,ā€œbatchSizeā€:1,ā€œresponseTimeoutā€:120, ā€œmaxBatchDelayā€:50}}}} 2022-06-01T13:52:17,328 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg 2022-06-01T13:52:17,329 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully [I 220601 13:52:17 main:75] Wrapper : Model names [ā€˜my-modelā€™], inference address http//0.0.0.0:8085, management address http://0.0.0.0:8085, model store /mnt/models/model-store [I 220601 13:52:17 TorchserveModel:54] kfmodel Predict URL set to 0.0.0.0:8085 [I 220601 13:52:17 TorchserveModel:56] kfmodel Explain URL set to 0.0.0.0:8085 [I 220601 13:52:17 TSModelRepository:30] TSModelRepo is initialized [I 220601 13:52:17 kfserver:150] Registering model: my-model [I 220601 13:52:17 kfserver:120] Setting asyncio max_workers as 28 [I 220601 13:52:17 kfserver:127] Listening on port 8080 [I 220601 13:52:17 kfserver:129] Will fork 1 workers 2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my-model 2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model 2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model 2022-06-01T13:52:50,088 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model my-model loaded. 2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: my-model, count: 1 2022-06-01T13:52:50,096 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:52:50,097 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2022-06-01T13:52:50,158 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085 2022-06-01T13:52:50,159 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2022-06-01T13:52:50,160 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082 Model server started. 2022-06-01T13:52:50,363 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2022-06-01T13:52:50,415 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:40.0|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:94.23827743530273|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:774.5930213928223|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.2|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:23591.0078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:8041.05078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,419 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:26.5|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570 2022-06-01T13:52:50,766 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]50 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:52:50,768 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change null -> WORKER_STARTED 2022-06-01T13:52:50,774 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:50,784 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:52:50,787 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091570787 2022-06-01T13:52:50,819 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:52:50,888 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_allocā€™ 2022-06-01T13:52:50,889 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:52:58,067 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:52:58,090 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:52:58,091 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:52:58,092 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2022-06-01T13:52:59,092 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:52:59,812 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]107 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:52:59,813 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091579815 2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:52:59,836 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:52:59,912 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_allocā€™ 2022-06-01T13:52:59,913 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:03,968 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:03,969 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:03,969 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:03,972 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:03,973 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:03,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds. 2022-06-01T13:53:04,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:05,584 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]135 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:05,585 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:05,586 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091585588 2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:05,608 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:05,680 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_allocā€™ 2022-06-01T13:53:05,681 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:09,625 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:09,633 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:09,633 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds. 2022-06-01T13:53:11,634 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:12,255 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]163 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:12,256 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:12,257 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091592259 2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:12,284 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:12,361 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_allocā€™ 2022-06-01T13:53:12,362 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:16,966 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:16,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:16,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds. 2022-06-01T13:53:19,975 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:20,544 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]191 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:20,545 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091600547 2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:20,569 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:20,644 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_allocā€™ 2022-06-01T13:53:20,645 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:25,279 [INFO ] epollEventLoopGroup-5-5 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:25,287 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:25,288 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 5 seconds. 2022-06-01T13:53:30,288 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000] 2022-06-01T13:53:31,052 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]219 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started. 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0 2022-06-01T13:53:31,053 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED 2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000 2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091611055 2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000. 2022-06-01T13:53:31,056 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1 2022-06-01T13:53:31,130 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of ā€˜std::bad_allocā€™ 2022-06-01T13:53:31,131 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc 2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout 2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr 2022-06-01T13:53:35,936 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?] at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?] at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?] at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-06-01T13:53:35,941 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died. 2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED 2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr 2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout 2022-06-01T13:53:35,942 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 8 seconds.

Installation instructions

No, torchserve is running inside kserve

Model Packaing

step by step with torch-model-archiver

config.properties

install_py_dep_per_model=true

Versions

Name: torch Version: 1.9.0+cu111

Name: torch-model-archiver Version: 0.6.0

Repro instructions

kubectl apply -f bodypart.yaml -n kserve-test

apiVersion: ā€œserving.kserve.io/v1beta1ā€ kind: ā€œInferenceServiceā€ metadata: namespace: ā€œkserve-testā€ name: ā€œmy-modelā€ spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: ā€œv2ā€ storageUri: ā€œs3://ā€¦/kserve/ā€

Possible Solution

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:17 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
festinaiscommented, Jun 8, 2022

thank you! it works now šŸ˜ƒ it was my mistake, i was referring to kserve/ instead of pytorch/ā€¦

1reaction
festinaiscommented, Jun 2, 2022

No I couldnā€™t reproduce it without kserve. I tried with different python envs 3.8-3.10, and everything worked fine. The error always happened with kserve.

Read more comments on GitHub >

github_iconTop Results From Across the Web

12. Running TorchServe ā€” PyTorch/Serve master ...
You can use TorchServe to serve a prediction endpoint for a object detection and identification model that intakes images, then returns predictions.
Read more >
python - loading model failed in torchserving - Stack Overflow
i am using the model from kaggle. I presume you got the model from https://www.kaggle.com/pytorch/vgg16. I think you are loading the model ......
Read more >
How to Serve PyTorch Models with TorchServe - YouTube
Hamid Shojanazeri is a Partner Engineer at PyTorch, here to demonstrate the basics of using TorchServe. As the preferred model servingĀ ...
Read more >
BERT TorchServe Tutorial ā€” AWS Neuron documentation
You are not reading the most recent version of this documentation. v2.4.0 is the ... loaded from {model_dir}') self.model.to(self.device) self.model.eval() ...
Read more >
Deploying PyTorch models for inference at scale using ...
AWS is excited to share the experimental release of TorchServe, ... version production models for A/B testing, load and unload models ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found