Model not loaded - MODEL_LOG - what(): std::bad_alloc - Torchserve version: 0.5.2
See original GitHub issueš Describe the bug
Torschserve:: Name: torch-model-archiver Version: 0.6.0 The command to create the MAR file: torch-model-archiver --model-name bodypartrec --version 1.0 --serialized-file model.pt --handler model_handler.py --requirements-file requir/requirements.txt
KServe: Inferenceservice:
apiVersion: āserving.kserve.io/v1beta1ā kind: āInferenceServiceā metadata: namespace: ākserve-testā name: āmy-modelā spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: āv2ā storageUri: ās3://ā¦/kserve/ā
The error when torchserve is started with kserve is as described in the logs. So basically: āwhat(): std::bad_allocā is happening. There are no clues on how to proceed with. I saw that this issue is marked as a bug, but I donāt know in what case this is triggered!? Any help is really appreciated!
Error logs
Defaulted container ākserve-containerā out of: kserve-container, queue-proxy, storage-initializer (init)
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
/usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via pip install 'ray[default]'
. Please update your install command.
warnings.warn(
2022-06-01T13:52:17,164 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins managerā¦
2022-06-01T13:52:17,276 [INFO ] main org.pytorch.serve.ModelServer -
Torchserve version: 0.5.2
TS Home: /usr/local/lib/python3.8/dist-packages
Current directory: /home/model-server
Temp directory: /home/model-server/tmp
Number of GPUs: 0
Number of CPUs: 24
Max heap size: 512 M
Python executable: /usr/bin/python
Config file: /mnt/models/config/config.properties
Inference address: http://0.0.0.0:8085
Management address: http://0.0.0.0:8085
Metrics address: http://0.0.0.0:8082
Model Store: /mnt/models/model-store
Initial Models: N/A
Log dir: /home/model-server/logs
Metrics dir: /home/model-server/logs
Netty threads: 4
Netty client threads: 0
Default workers per model: 24
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Limit Maximum Image Pixels: true
Prefer direct buffer: false
Allowed Urls: [file://.|http(s)?š/.]
Custom python dependency for model allowed: true
Metrics report format: prometheus
Enable metrics API: true
Workflow Store: /mnt/models/model-store
Model config: N/A
2022-06-01T13:52:17,283 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer pluginā¦
2022-06-01T13:52:17,321 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Started restoring models from snapshot {ānameā:āstartup.cfgā,āmodelCountā:1,āmodelsā:{āmy-modelā:{ā1.0ā:{ādefaultVersionā:true,āmarNameā:ābodypartrec.marā,āminWorkersā:1,āmaxWorkersā:5,ābatchSizeā:1,āresponseTimeoutā:120, āmaxBatchDelayā:50}}}}
2022-06-01T13:52:17,328 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Validating snapshot startup.cfg
2022-06-01T13:52:17,329 [INFO ] main org.pytorch.serve.snapshot.SnapshotManager - Snapshot startup.cfg validated successfully
[I 220601 13:52:17 main:75] Wrapper : Model names [āmy-modelā], inference address http//0.0.0.0:8085, management address http://0.0.0.0:8085, model store /mnt/models/model-store
[I 220601 13:52:17 TorchserveModel:54] kfmodel Predict URL set to 0.0.0.0:8085
[I 220601 13:52:17 TorchserveModel:56] kfmodel Explain URL set to 0.0.0.0:8085
[I 220601 13:52:17 TSModelRepository:30] TSModelRepo is initialized
[I 220601 13:52:17 kfserver:150] Registering model: my-model
[I 220601 13:52:17 kfserver:120] Setting asyncio max_workers as 28
[I 220601 13:52:17 kfserver:127] Listening on port 8080
[I 220601 13:52:17 kfserver:129] Will fork 1 workers
2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model my-model
2022-06-01T13:52:17,671 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model
2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model my-model
2022-06-01T13:52:50,088 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model my-model loaded.
2022-06-01T13:52:50,088 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: my-model, count: 1
2022-06-01T13:52:50,096 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-01T13:52:50,097 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-06-01T13:52:50,158 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8085
2022-06-01T13:52:50,159 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-06-01T13:52:50,160 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Model server started.
2022-06-01T13:52:50,363 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2022-06-01T13:52:50,415 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:40.0|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:94.23827743530273|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,417 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:774.5930213928223|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:89.2|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:23591.0078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,418 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:8041.05078125|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,419 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:26.5|#Level:Host|#hostname:my-model-predictor-default-00001-deployment-64f4bc9566-8x7ps,timestamp:1654091570
2022-06-01T13:52:50,766 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]50
2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-01T13:52:50,767 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-01T13:52:50,768 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change null -> WORKER_STARTED
2022-06-01T13:52:50,774 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:52:50,784 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-01T13:52:50,787 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091570787
2022-06-01T13:52:50,819 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1
2022-06-01T13:52:50,888 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of āstd::bad_allocā
2022-06-01T13:52:50,889 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc
2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr
2022-06-01T13:52:58,066 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout
2022-06-01T13:52:58,067 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-06-01T13:52:58,080 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-06-01T13:52:58,090 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died.
2022-06-01T13:52:58,091 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr
2022-06-01T13:52:58,091 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout
2022-06-01T13:52:58,092 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-06-01T13:52:59,092 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-01T13:52:59,812 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]107
2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-01T13:52:59,813 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-01T13:52:59,813 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091579815
2022-06-01T13:52:59,816 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-01T13:52:59,836 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1
2022-06-01T13:52:59,912 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of āstd::bad_allocā
2022-06-01T13:52:59,913 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc
2022-06-01T13:53:03,968 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr
2022-06-01T13:53:03,969 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout
2022-06-01T13:53:03,969 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-06-01T13:53:03,972 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-06-01T13:53:03,973 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died.
2022-06-01T13:53:03,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr
2022-06-01T13:53:03,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout
2022-06-01T13:53:03,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2022-06-01T13:53:04,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-01T13:53:05,584 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]135
2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-01T13:53:05,585 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-06-01T13:53:05,585 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-01T13:53:05,586 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091585588
2022-06-01T13:53:05,588 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-01T13:53:05,608 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1
2022-06-01T13:53:05,680 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of āstd::bad_allocā
2022-06-01T13:53:05,681 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc
2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr
2022-06-01T13:53:09,625 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout
2022-06-01T13:53:09,625 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-06-01T13:53:09,632 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died.
2022-06-01T13:53:09,633 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr
2022-06-01T13:53:09,633 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout
2022-06-01T13:53:09,633 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 2 seconds.
2022-06-01T13:53:11,634 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-01T13:53:12,255 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]163
2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-01T13:53:12,256 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-01T13:53:12,256 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-06-01T13:53:12,257 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091592259
2022-06-01T13:53:12,259 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-01T13:53:12,284 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1
2022-06-01T13:53:12,361 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of āstd::bad_allocā
2022-06-01T13:53:12,362 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc
2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr
2022-06-01T13:53:16,966 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout
2022-06-01T13:53:16,966 [INFO ] epollEventLoopGroup-5-4 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-06-01T13:53:16,973 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died.
2022-06-01T13:53:16,974 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr
2022-06-01T13:53:16,974 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout
2022-06-01T13:53:16,974 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 3 seconds.
2022-06-01T13:53:19,975 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-01T13:53:20,544 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]191
2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-01T13:53:20,545 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-06-01T13:53:20,545 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091600547
2022-06-01T13:53:20,547 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-01T13:53:20,569 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1
2022-06-01T13:53:20,644 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of āstd::bad_allocā
2022-06-01T13:53:20,645 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc
2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr
2022-06-01T13:53:25,278 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout
2022-06-01T13:53:25,279 [INFO ] epollEventLoopGroup-5-5 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-06-01T13:53:25,286 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died.
2022-06-01T13:53:25,287 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr
2022-06-01T13:53:25,287 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout
2022-06-01T13:53:25,288 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 5 seconds.
2022-06-01T13:53:30,288 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/usr/bin/python, /usr/local/lib/python3.8/dist-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /home/model-server/tmp/.ts.sock.9000]
2022-06-01T13:53:31,052 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Listening on port: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - [PID]219
2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Torch worker started.
2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Python runtime: 3.8.0
2022-06-01T13:53:31,053 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2022-06-01T13:53:31,053 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req. to backend at: 1654091611055
2022-06-01T13:53:31,055 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2022-06-01T13:53:31,056 [INFO ] W-9000-my-model_1.0-stdout MODEL_LOG - model_name: my-model, batchSize: 1
2022-06-01T13:53:31,130 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - terminate called after throwing an instance of āstd::bad_allocā
2022-06-01T13:53:31,131 [WARN ] W-9000-my-model_1.0-stderr MODEL_LOG - what(): std::bad_alloc
2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stdout
2022-06-01T13:53:35,936 [INFO ] W-9000-my-model_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-my-model_1.0-stderr
2022-06-01T13:53:35,936 [INFO ] epollEventLoopGroup-5-6 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException: null
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056) ~[?:?]
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133) ~[?:?]
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432) ~[?:?]
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:189) [model-server.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
2022-06-01T13:53:35,941 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: my-model, error: Worker died.
2022-06-01T13:53:35,941 [DEBUG] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-my-model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stderr
2022-06-01T13:53:35,942 [WARN ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-my-model_1.0-stdout
2022-06-01T13:53:35,942 [INFO ] W-9000-my-model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 8 seconds.
Installation instructions
No, torchserve is running inside kserve
Model Packaing
step by step with torch-model-archiver
config.properties
install_py_dep_per_model=true
Versions
Name: torch Version: 1.9.0+cu111
Name: torch-model-archiver Version: 0.6.0
Repro instructions
kubectl apply -f bodypart.yaml -n kserve-test
apiVersion: āserving.kserve.io/v1beta1ā kind: āInferenceServiceā metadata: namespace: ākserve-testā name: āmy-modelā spec: predictor: serviceAccountName: kserve-sa pytorch: protocolVersion: āv2ā storageUri: ās3://ā¦/kserve/ā
Possible Solution
No response
Issue Analytics
- State:
- Created a year ago
- Comments:17 (3 by maintainers)
Top GitHub Comments
thank you! it works now š it was my mistake, i was referring to kserve/ instead of pytorch/ā¦
No I couldnāt reproduce it without kserve. I tried with different python envs 3.8-3.10, and everything worked fine. The error always happened with kserve.