question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

0.4.0 and 0.4.2 GPU/CPU hits system thread count limit – threads not cleaned up

See original GitHub issue

Context

We are currently testing torchserve to run our models. Our issue is that we are seeing torchserve create a few new threads per inference request and never clean them up. The requests themselves seem to go through properly.

  • torchserve version: happened on torchserve-0.4.0-gpu and torchserve-0.4.2-gpu. Same issue on -cpu flavor.

EDIT: We are currently suspecting workflows to be the culprit. Thread spam does not happen when using the models directly.

We’re testing some minimal deployments right now, but it’s entirely possible the fault is on our side.

We’re looking for ideas — what could cause such behavior?

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: yes
  • Is it a CPU or GPU environment?: tested both, same issue
  • Using a default/custom handler? [If possible upload/share custom handler/model]: yes, cannot share at this time
  • What kind of model is it e.g. vision, text, audio?: vision
  • Are you planning to use local models from model-store or public url being used e.g. from S3 bucket etc.? [If public url then provide link.]: model-store
  • Provide config.properties, logs [ts.log] and parameters used for model registration/update APIs: defaults from docker image

Expected Behavior

Threads should die properly after some time. We shouldn’t hit the system limit.

Current Behavior

Thread count grows inside the java process until it errors out saying it could not create new threads.

Failure Logs [if any]

One thing that is interesting is that it seems netty (maybe?) is creating MULTIPLE thread pools, since the logs we’re seeing refer to threads pool-402-thread-2, where pool number goes up but thread number is 1-2.

jstack confirms massive amounts of threads in the following state:

"pool-402-thread-2" #1649 prio=5 os_prio=0 cpu=0.20ms elapsed=0.28s tid=0x00007efb9c046800 nid=0x77e waiting on condition  [0x00007efadc587000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
	- parking to wait for  <0x000000070b0a0600> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.11/LockSupport.java:234)
	at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.11/CompletableFuture.java:1798)
	at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.11/ForkJoinPool.java:3128)
	at java.util.concurrent.CompletableFuture.timedGet(java.base@11.0.11/CompletableFuture.java:1868)
	at java.util.concurrent.CompletableFuture.get(java.base@11.0.11/CompletableFuture.java:2021)
	at org.pytorch.serve.ensemble.DagExecutor.invokeModel(DagExecutor.java:157)
	at org.pytorch.serve.ensemble.DagExecutor.lambda$execute$0(DagExecutor.java:74)
	at org.pytorch.serve.ensemble.DagExecutor$$Lambda$82/0x00000008401f6440.call(Unknown Source)
	at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.11/Executors.java:515)
	at java.util.concurrent.FutureTask.run(java.base@11.0.11/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)

Other threads are just WAITING:

   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.11/Native Method)
	- parking to wait for  <0x000000070b0206f0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.11/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11.0.11/AbstractQueuedSynchronizer.java:2081)
	at java.util.concurrent.LinkedBlockingQueue.take(java.base@11.0.11/LinkedBlockingQueue.java:433)
	at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@11.0.11/ThreadPoolExecutor.java:1054)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.11/ThreadPoolExecutor.java:1114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.11/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run(java.base@11.0.11/Thread.java:829)```

I'm here to provide more info. Any input at all is appreciatted!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
xingenercommented, Jan 10, 2022

Ran into similar situation, too. Workflow works well in less than 16k requests, then ran into 500 error(the api said it hit cpu/process limit.)

0reactions
msaroufimcommented, May 4, 2022

This issue is a duplicate of https://github.com/pytorch/serve/issues/1581 - so closing this for now and tracking fix in other issue

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configure the max worker threads Server Configuration Option
Find out how to use the max worker threads option to configure the number of worker threads that are available to SQL Server...
Read more >
multi-thread count vs CPU's - Oracle Communities
Hi, We have been executing mult-thread jobs(4 threads) since few years using dbms_job.submit.
Read more >
Max Worker Threads: Don't Touch That - Brent Ozar Unlimited®
Running out of worker threads and queries having to wait for them to ... worker thread count by a fair amount, and this...
Read more >
java - How to scale threads according to CPU cores?
You can determine the number of processes available to the Java Virtual Machine by using the static Runtime method, availableProcessors.
Read more >
AQ Messaging - api for supplying recipient list - Oracle/Helidon
Issue Title Created Date Comment Count Updated Date Wrong depth measurements 1 2021‑10‑26 2022‑10‑07 Understanding megalodon output in IGV 9 2021‑11‑08 2022‑10‑06 Stats not working on...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found