Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Confusing Backend worker failures logs on Google Cloud Run

See original GitHub issue

Context

torchserve version: 0.4.2
torch-model-archiver version: What ever version comes with the docker image (not sure how to check)
torch version: 1.9.0
java version: openjdk 11.0.11 2021-04-20
Operating System and version: docker image pytorch/torchserve:0.4.2-cpu

Your Environment

Installed using source? [yes/no]: No
Are you planning to deploy it using docker container? [yes/no]: Yes
Is it a CPU or GPU environment?: CPU
Using a default/custom handler? [If possible upload/share custom handler/model]: Custom, but getting same results with image_classifier handler
What kind of model is it e.g. vision, text, audio?: vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: local models
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: n/a

Expected Behavior

Clearer logs about Backend worker failures.

Current Behavior

After deploying the torch-serve docker container to Google Cloud Run I see the regular start up logs from torch-serve in the first 9 seconds:

2021-10-14 04:28:01,593 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
...
2021-10-14 04:28:02,031 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin
...
2021-10-14 04:28:02,106 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: Togo.mar
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model Togo
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model Togo
2021-10-14 04:28:02,191 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model Togo loaded.
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: Togo, count: 2
...
2021-10-14 04:28:03,773 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-10-14 04:28:03,776 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082

Lastly (this I see only on Google Cloud Run)

Container Sandbox: Unsupported syscall setsockopt(0xa0,0x1,0xd,0x3e9999efcf70,0x8,0x4).

Then without calling any API after 3 minutes I see:

2021-10-14 04:31:14,244 [WARN ] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-Togo_1.0-stdout
2021-10-14 04:31:08,840 [WARN ] W-9007-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9007-Ethiopia_Tigray_2020_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9002-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-Rwanda_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9003-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9003-Rwanda_1.0-stdout
2021-10-14 04:30:56,141 [WARN ] W-9006-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-Ethiopia_Tigray_2020_1.0-stderr
2021-10-14 04:30:56,141 [WARN ] W-9001-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-Togo_1.0-stderr
2021-10-14 04:30:56,340 [WARN ] W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,540 [WARN ] W-9008-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9008-Kenya_1.0-stderr
2021-10-14 04:30:49,440 [WARN ] W-9009-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9009-Kenya_1.0-stderr
2021-10-14 04:30:49,640 [WARN ] W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,640 [INFO ] W-9003-Rwanda_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9003-Rwanda_1.0-stderr
2021-10-14 04:30:56,740 [INFO ] W-9009-Kenya_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9009-Kenya_1.0-stderr
2021-10-14 04:31:14,841 [ERROR] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.
	at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:85)
	at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:281)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:179)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The models and prediction endpoint both work as expected but since I am running torch serve for thousands of examples my logs are full of these “errors”. Cannot reproduce locally, only on Google Cloud Run. Models are loaded with BaseHandler’s torch.jit.load

Possible Solution

Might be something to do with how Google Cloud Run shuts down inactive container instances?

Failure Logs [if any]

Provided above.

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

1reaction

lxningcommented, Nov 4, 2021

@ivanzvonkov I close this ticket since the fix needs to be done in google cloud. Please feel free reopen this ticket if it is needed.

0reactions

lxningcommented, Nov 4, 2021

@ivanzvonkov TS calls setsockopt to create uds connection b/w frontend and backend. However, there is a [bug (https://github.com/google/gvisor/issues/1739) in google cloud. This bug causes TS init backend worker timeout.

Top Results From Across the Web

Troubleshoot Cloud Run issues

[CRITICAL] WORKER TIMEOUT. To resolve this issue, follow these troubleshooting recommendations: Use Cloud Logging to look for out of memory errors in the ......

GCP cloud run job fails without a reason - Server Fault

My CloudRun Jobs work just fine, most of the times. I see them starting and finishing just as expected. Sometimes though they fail, ......

Deploy Django backend on google cloud gives error on ...

This will show you the errors while trying to start the workers. The errors will give you a clue why the deployment is...

No logs for Google Cloud Endpoints on Cloud Run

Also, could you take a look at the ESPv2 application logs for any errors or warnings? My initial guess is that the service...

Google Cloud Platform Tutorial: From Zero to Hero with GCP

Automate the creation and configuration of your resources; Manage operations: logging, monitoring, tracing, and so on. Store your data; Deploy ...