Error 507 when predicting
See original GitHub issueHello,
I have a rather large model that I need to use for prediction. When I make the request, I receive an error 507 and a message stating that the worker has died on the server side
2020-06-10 11:30:41,286 [DEBUG] W-9000-model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
at java.base/java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:513)
at java.base/java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:675)
at org.pytorch.serve.wlm.Model.pollBatch(Model.java:155)
at org.pytorch.serve.wlm.BatchAggregator.getRequest(BatchAggregator.java:33)
at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:123)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
I suspect this is somehow related with the JVM memory. I am therefore using a config.properties file with the following entry
vmargs=-Xmx128g
(of course the model needs much less than 128gb)
I am not using the GPU for this prediction as it is just a test. I am also running within a docker container.
How can I debug this? is there a way to get better error messages and a stack trace (for example finding out if pytorch has issues allocating the model or similar problems?)
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (2 by maintainers)
Top Results From Across the Web
507 Insufficient Storage - HTTP Status Code Glossary - WebFX
This condition is considered to be temporary. If the request that received this status code was the result of a user action, the...
Read more >507 Insufficient Storage - HTTP - MDN Web Docs
It indicates that a method could not be performed because the server cannot store the representation needed to successfully complete the request ......
Read more >How to fix error 507 for Creative Cloud apps - Adobe Support
You get error code 507 when the installation of your Creative Cloud app fails due to the volume for your installed files being...
Read more >Error code reference - IBM
Error codes are listed in numerical order. ... No enclosure identity and no node state on partner; 507 ... Flash module is predicted...
Read more >Display generic error message and exit - Stata
The by-variable takes on too many different values to construct a readable chart. 135. not possible with weighted data. You attempted to predict...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you for your help. I will try on a bigger machine in the cloud with a GPU. I suspect that since the Docker container cannot swap, and since there are multiple workers which might holt their own copy of stuff, the memory consumption is so high that the system runs legitimately out of memory.
I will update you within 1 hour.
@faustomilletari : How many workers are you using for your model and can you share the output of top commands for different scenarios like after starting the workers, running the inference with smaller input size and larger input size.
Please also share the different log files from the log folder, which is by default generate on the path from where you start
TorchServe