Confusing Backend worker failures logs on Google Cloud Run
See original GitHub issueContext
- torchserve version: 0.4.2
- torch-model-archiver version: What ever version comes with the docker image (not sure how to check)
- torch version: 1.9.0
- java version: openjdk 11.0.11 2021-04-20
- Operating System and version: docker image pytorch/torchserve:0.4.2-cpu
Your Environment
- Installed using source? [yes/no]: No
- Are you planning to deploy it using docker container? [yes/no]: Yes
- Is it a CPU or GPU environment?: CPU
- Using a default/custom handler? [If possible upload/share custom handler/model]: Custom, but getting same results with image_classifier handler
- What kind of model is it e.g. vision, text, audio?: vision
- Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: local models
- Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: n/a
Expected Behavior
Clearer logs about Backend worker failures.
Current Behavior
After deploying the torch-serve docker container to Google Cloud Run I see the regular start up logs from torch-serve in the first 9 seconds:
2021-10-14 04:28:01,593 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
...
2021-10-14 04:28:02,031 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin
...
2021-10-14 04:28:02,106 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: Togo.mar
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model Togo
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model Togo
2021-10-14 04:28:02,191 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model Togo loaded.
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: Togo, count: 2
...
2021-10-14 04:28:03,773 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-10-14 04:28:03,776 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082
Lastly (this I see only on Google Cloud Run)
Container Sandbox: Unsupported syscall setsockopt(0xa0,0x1,0xd,0x3e9999efcf70,0x8,0x4).
Then without calling any API after 3 minutes I see:
2021-10-14 04:31:14,244 [WARN ] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-Togo_1.0-stdout
2021-10-14 04:31:08,840 [WARN ] W-9007-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9007-Ethiopia_Tigray_2020_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9002-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-Rwanda_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9003-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9003-Rwanda_1.0-stdout
2021-10-14 04:30:56,141 [WARN ] W-9006-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-Ethiopia_Tigray_2020_1.0-stderr
2021-10-14 04:30:56,141 [WARN ] W-9001-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-Togo_1.0-stderr
2021-10-14 04:30:56,340 [WARN ] W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,540 [WARN ] W-9008-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9008-Kenya_1.0-stderr
2021-10-14 04:30:49,440 [WARN ] W-9009-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9009-Kenya_1.0-stderr
2021-10-14 04:30:49,640 [WARN ] W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,640 [INFO ] W-9003-Rwanda_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9003-Rwanda_1.0-stderr
2021-10-14 04:30:56,740 [INFO ] W-9009-Kenya_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9009-Kenya_1.0-stderr
2021-10-14 04:31:14,841 [ERROR] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.
at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:85)
at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:281)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:179)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
The models and prediction endpoint both work as expected but since I am running torch serve for thousands of examples my logs are full of these “errors”. Cannot reproduce locally, only on Google Cloud Run. Models are loaded with BaseHandler’s torch.jit.load
Possible Solution
Might be something to do with how Google Cloud Run shuts down inactive container instances?
Failure Logs [if any]
Provided above.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top Results From Across the Web
Troubleshoot Cloud Run issues
[CRITICAL] WORKER TIMEOUT. To resolve this issue, follow these troubleshooting recommendations: Use Cloud Logging to look for out of memory errors in the ......
Read more >GCP cloud run job fails without a reason - Server Fault
My CloudRun Jobs work just fine, most of the times. I see them starting and finishing just as expected. Sometimes though they fail, ......
Read more >Deploy Django backend on google cloud gives error on ...
This will show you the errors while trying to start the workers. The errors will give you a clue why the deployment is...
Read more >No logs for Google Cloud Endpoints on Cloud Run
Also, could you take a look at the ESPv2 application logs for any errors or warnings? My initial guess is that the service...
Read more >Google Cloud Platform Tutorial: From Zero to Hero with GCP
Automate the creation and configuration of your resources; Manage operations: logging, monitoring, tracing, and so on. Store your data; Deploy ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@ivanzvonkov I close this ticket since the fix needs to be done in google cloud. Please feel free reopen this ticket if it is needed.
@ivanzvonkov TS calls setsockopt to create uds connection b/w frontend and backend. However, there is a [bug (https://github.com/google/gvisor/issues/1739) in google cloud. This bug causes TS init backend worker timeout.