Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

0.4.0 and 0.4.2 GPU/CPU hits system thread count limit – threads not cleaned up

See original GitHub issue

Context

We are currently testing torchserve to run our models. Our issue is that we are seeing torchserve create a few new threads per inference request and never clean them up. The requests themselves seem to go through properly.

torchserve version: happened on torchserve-0.4.0-gpu and torchserve-0.4.2-gpu. Same issue on -cpu flavor.

EDIT: We are currently suspecting workflows to be the culprit. Thread spam does not happen when using the models directly.

We’re testing some minimal deployments right now, but it’s entirely possible the fault is on our side.

We’re looking for ideas — what could cause such behavior?

Your Environment

Installed using source? [yes/no]: no
Are you planning to deploy it using docker container? [yes/no]: yes
Is it a CPU or GPU environment?: tested both, same issue
Using a default/custom handler? [If possible upload/share custom handler/model]: yes, cannot share at this time
What kind of model is it e.g. vision, text, audio?: vision
Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: model-store
Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: defaults from docker image

Expected Behavior

Threads should die properly after some time. We shouldn’t hit the system limit.

Current Behavior

Thread count grows inside the java process until it errors out saying it could not create new threads.

Failure Logs [if any]

One thing that is interesting is that it seems netty (maybe?) is creating MULTIPLE thread pools, since the logs we’re seeing refer to threads pool-402-thread-2, where pool number goes up but thread number is 1-2.

jstack confirms massive amounts of threads in the following state:

"pool-402-thread-2" #1649 prio=5 os_prio=0 cpu=0.20ms elapsed=0.28s tid=0x00007efb9c046800 nid=0x77e waiting on condition  [0x00007efadc587000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
	- parking to wait for  <0x000000070b0a0600> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.11/CompletableFuture.java:1798)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.11/ForkJoinPool.java:3128)
	at java.util.concurrent.CompletableFuture.timedGet(java.base@11.0.11/CompletableFuture.java:1868)
	at java.util.concurrent.CompletableFuture.get(java.base@11.0.11/CompletableFuture.java:2021)
	at org.pytorch.serve.ensemble.DagExecutor.invokeModel(DagExecutor.java:157)
	at org.pytorch.serve.ensemble.DagExecutor.lambda$execute$0(DagExecutor.java:74)
	at org.pytorch.serve.ensemble.DagExecutor$$Lambda$82/0x00000008401f6440.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.11/Executors.java:515)
	at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

Other threads are just WAITING:

   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
	- parking to wait for  <0x000000070b0206f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.11/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.11/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)```

I'm here to provide more info. Any input at all is appreciatted!

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

xingenercommented, Jan 10, 2022

Ran into similar situation, too. Workflow works well in less than 16k requests, then ran into 500 error(the api said it hit cpu/process limit.)

0reactions

msaroufimcommented, May 4, 2022

This issue is a duplicate of https://github.com/pytorch/serve/issues/1581 - so closing this for now and tracking fix in other issue

Top Results From Across the Web

Configure the max worker threads Server Configuration Option

Find out how to use the max worker threads option to configure the number of worker threads that are available to SQL Server...

multi-thread count vs CPU's - Oracle Communities

Hi, We have been executing mult-thread jobs(4 threads) since few years using dbms_job.submit.

Max Worker Threads: Don't Touch That - Brent Ozar Unlimited®

Running out of worker threads and queries having to wait for them to ... worker thread count by a fair amount, and this...

java - How to scale threads according to CPU cores?

You can determine the number of processes available to the Java Virtual Machine by using the static Runtime method, availableProcessors.

AQ Messaging - api for supplying recipient list - Oracle/Helidon

Issue Title Created Date Comment Count Updated Date Wrong depth measurements 1 2021‑10‑26 2022‑10‑07 Understanding megalodon output in IGV 9 2021‑11‑08 2022‑10‑06 Stats not working on...