0.4.0 and 0.4.2 GPU/CPU hits system thread count limit – threads not cleaned up
See original GitHub issueContext
We are currently testing torchserve to run our models. Our issue is that we are seeing torchserve create a few new threads per inference request and never clean them up. The requests themselves seem to go through properly.
- torchserve version: happened on torchserve-0.4.0-gpuandtorchserve-0.4.2-gpu. Same issue on-cpuflavor.
EDIT: We are currently suspecting workflows to be the culprit. Thread spam does not happen when using the models directly.
We’re testing some minimal deployments right now, but it’s entirely possible the fault is on our side.
We’re looking for ideas — what could cause such behavior?
Your Environment
- Installed using source? [yes/no]: no
- Are you planning to deploy it using docker container? [yes/no]: yes
- Is it a CPU or GPU environment?: tested both, same issue
- Using a default/custom handler? [If possible upload/share custom handler/model]: yes, cannot share at this time
- What kind of model is it e.g. vision, text, audio?: vision
- Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: model-store
- Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: defaults from docker image
Expected Behavior
Threads should die properly after some time. We shouldn’t hit the system limit.
Current Behavior
Thread count grows inside the java process until it errors out saying it could not create new threads.
Failure Logs [if any]
One thing that is interesting is that it seems netty (maybe?) is creating MULTIPLE thread pools, since the logs we’re seeing refer to threads pool-402-thread-2, where pool number goes up but thread number is 1-2.
jstack confirms massive amounts of threads in the following state:
"pool-402-thread-2" #1649 prio=5 os_prio=0 cpu=0.20ms elapsed=0.28s tid=0x00007efb9c046800 nid=0x77e waiting on condition  [0x00007efadc587000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
	- parking to wait for  <0x000000070b0a0600> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.11/CompletableFuture.java:1798)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.11/ForkJoinPool.java:3128)
	at java.util.concurrent.CompletableFuture.timedGet(java.base@11.0.11/CompletableFuture.java:1868)
	at java.util.concurrent.CompletableFuture.get(java.base@11.0.11/CompletableFuture.java:2021)
	at org.pytorch.serve.ensemble.DagExecutor.invokeModel(DagExecutor.java:157)
	at org.pytorch.serve.ensemble.DagExecutor.lambda$execute$0(DagExecutor.java:74)
	at org.pytorch.serve.ensemble.DagExecutor$$Lambda$82/0x00000008401f6440.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.11/Executors.java:515)
	at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)
Other threads are just WAITING:
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
	- parking to wait for  <0x000000070b0206f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.11/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.11/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)```
I'm here to provide more info. Any input at all is appreciatted!Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)

 Top Related Medium Post
Top Related Medium Post Top Related StackOverflow Question
Top Related StackOverflow Question
Ran into similar situation, too. Workflow works well in less than 16k requests, then ran into 500 error(the api said it hit cpu/process limit.)
This issue is a duplicate of https://github.com/pytorch/serve/issues/1581 - so closing this for now and tracking fix in other issue