question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Confusing Backend worker failures logs on Google Cloud Run

See original GitHub issue

Context

  • torchserve version: 0.4.2
  • torch-model-archiver version: What ever version comes with the docker image (not sure how to check)
  • torch version: 1.9.0
  • java version: openjdk 11.0.11 2021-04-20
  • Operating System and version: docker image pytorch/torchserve:0.4.2-cpu

Your Environment

  • Installed using source? [yes/no]: No
  • Are you planning to deploy it using docker container? [yes/no]: Yes
  • Is it a CPU or GPU environment?: CPU
  • Using a default/custom handler? [If possible upload/share custom handler/model]: Custom, but getting same results with image_classifier handler
  • What kind of model is it e.g. vision, text, audio?: vision
  • Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: local models
  • Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: n/a

Expected Behavior

Clearer logs about Backend worker failures.

Current Behavior

After deploying the torch-serve docker container to Google Cloud Run I see the regular start up logs from torch-serve in the first 9 seconds:

2021-10-14 04:28:01,593 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager...
...
2021-10-14 04:28:02,031 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin
...
2021-10-14 04:28:02,106 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: Togo.mar
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model Togo
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model Togo
2021-10-14 04:28:02,191 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model Togo loaded.
2021-10-14 04:28:02,191 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: Togo, count: 2
...
2021-10-14 04:28:03,773 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2021-10-14 04:28:03,776 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://0.0.0.0:8082

Lastly (this I see only on Google Cloud Run)

Container Sandbox: Unsupported syscall setsockopt(0xa0,0x1,0xd,0x3e9999efcf70,0x8,0x4).

Then without calling any API after 3 minutes I see:

2021-10-14 04:31:14,244 [WARN ] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-Togo_1.0-stdout
2021-10-14 04:31:08,840 [WARN ] W-9007-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9007-Ethiopia_Tigray_2020_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9002-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9002-Rwanda_1.0-stdout
2021-10-14 04:31:13,040 [WARN ] W-9003-Rwanda_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9003-Rwanda_1.0-stdout
2021-10-14 04:30:56,141 [WARN ] W-9006-Ethiopia_Tigray_2020_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9006-Ethiopia_Tigray_2020_1.0-stderr
2021-10-14 04:30:56,141 [WARN ] W-9001-Togo_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9001-Togo_1.0-stderr
2021-10-14 04:30:56,340 [WARN ] W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9005-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,540 [WARN ] W-9008-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9008-Kenya_1.0-stderr
2021-10-14 04:30:49,440 [WARN ] W-9009-Kenya_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9009-Kenya_1.0-stderr
2021-10-14 04:30:49,640 [WARN ] W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9004-Ethiopia_Tigray_2021_w_forecaster_1.0-stderr
2021-10-14 04:30:56,640 [INFO ] W-9003-Rwanda_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9003-Rwanda_1.0-stderr
2021-10-14 04:30:56,740 [INFO ] W-9009-Kenya_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9009-Kenya_1.0-stderr
2021-10-14 04:31:14,841 [ERROR] W-9000-Togo_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error
org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.
	at org.pytorch.serve.wlm.WorkerLifeCycle.startWorker(WorkerLifeCycle.java:85)
	at org.pytorch.serve.wlm.WorkerThread.connect(WorkerThread.java:281)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:179)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

The models and prediction endpoint both work as expected but since I am running torch serve for thousands of examples my logs are full of these “errors”. Cannot reproduce locally, only on Google Cloud Run. Models are loaded with BaseHandler’s torch.jit.load

Possible Solution

Might be something to do with how Google Cloud Run shuts down inactive container instances?

Failure Logs [if any]

Provided above.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
lxningcommented, Nov 4, 2021

@ivanzvonkov I close this ticket since the fix needs to be done in google cloud. Please feel free reopen this ticket if it is needed.

0reactions
lxningcommented, Nov 4, 2021

@ivanzvonkov TS calls setsockopt to create uds connection b/w frontend and backend. However, there is a [bug (https://github.com/google/gvisor/issues/1739) in google cloud. This bug causes TS init backend worker timeout.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Cloud Run issues
[CRITICAL] WORKER TIMEOUT. To resolve this issue, follow these troubleshooting recommendations: Use Cloud Logging to look for out of memory errors in the ......
Read more >
GCP cloud run job fails without a reason - Server Fault
My CloudRun Jobs work just fine, most of the times. I see them starting and finishing just as expected. Sometimes though they fail, ......
Read more >
Deploy Django backend on google cloud gives error on ...
This will show you the errors while trying to start the workers. The errors will give you a clue why the deployment is...
Read more >
No logs for Google Cloud Endpoints on Cloud Run
Also, could you take a look at the ESPv2 application logs for any errors or warnings? My initial guess is that the service...
Read more >
Google Cloud Platform Tutorial: From Zero to Hero with GCP
Automate the creation and configuration of your resources; Manage operations: logging, monitoring, tracing, and so on. Store your data; Deploy ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found