question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error 507 when predicting

See original GitHub issue

Hello,

I have a rather large model that I need to use for prediction. When I make the request, I receive an error 507 and a message stating that the worker has died on the server side

2020-06-10 11:30:41,286 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
	at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
	at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
	at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
	at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

I suspect this is somehow related with the JVM memory. I am therefore using a config.properties file with the following entry

vmargs=-Xmx128g (of course the model needs much less than 128gb)

I am not using the GPU for this prediction as it is just a test. I am also running within a docker container.

How can I debug this? is there a way to get better error messages and a stack trace (for example finding out if pytorch has issues allocating the model or similar problems?)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
faustomilletaricommented, Jun 10, 2020

Thank you for your help. I will try on a bigger machine in the cloud with a GPU. I suspect that since the Docker container cannot swap, and since there are multiple workers which might holt their own copy of stuff, the memory consumption is so high that the system runs legitimately out of memory.

I will update you within 1 hour.

1reaction
harshbafnacommented, Jun 10, 2020

@faustomilletari : How many workers are you using for your model and can you share the output of top commands for different scenarios like after starting the workers, running the inference with smaller input size and larger input size.

Please also share the different log files from the log folder, which is by default generate on the path from where you start TorchServe

Read more comments on GitHub >

github_iconTop Results From Across the Web

507 Insufficient Storage - HTTP Status Code Glossary - WebFX
This condition is considered to be temporary. If the request that received this status code was the result of a user action, the...
Read more >
507 Insufficient Storage - HTTP - MDN Web Docs
It indicates that a method could not be performed because the server cannot store the representation needed to successfully complete the request ......
Read more >
How to fix error 507 for Creative Cloud apps - Adobe Support
You get error code 507 when the installation of your Creative Cloud app fails due to the volume for your installed files being...
Read more >
Error code reference - IBM
Error codes are listed in numerical order. ... No enclosure identity and no node state on partner; 507 ... Flash module is predicted...
Read more >
Display generic error message and exit - Stata
The by-variable takes on too many different values to construct a readable chart. 135. not possible with weighted data. You attempted to predict...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found