how multiple instances on the same device can be concurrent
See original GitHub issueAfter reading the code, I was confused about how multiple instances on the same device can be concurrent
Multiple instances on the same device share a TritonBackendThread object, in src/backends/backend/triton_model_instance.cc
model_instances_ save all instances on this device
In the function, TritonModelInstance::TritonBackendThread::BackendThread,
My question is, if there are two instances of model x, A and B on device 0
model_->Server()->GetRateLimiter()->DequeuePayload(model_instances_, &payload);
Obtain the payload and assume instance A is assigned to the payload, then, ‘payload->Execute()’, start to forward(),
instance B cannot be assigned payload and execution until the instance A is completed?
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
How to handle multiple concurrent instances at the same time?
The first thing your program should do is try to create a shared memory segment (using your own made up key) and store...
Read more >Run multiple concurrent UI flows on a single Windows Server ...
Use two or more user accounts to create UI Flows connections targeting the gateway on this machine. You can now run multiple UI...
Read more >Maximum concurrent requests per instance (services)
By default each Cloud Run container instance can receive up to 80 requests at the same time; you can increase this to a...
Read more >How to Install Multiple Copies and Run Multiple Instances of ...
Here is how you can run multiple instances of an app using Parallel Space: Open Parallel Space and tap on the apps you...
Read more >Is it possible to have several people on my team connected to ...
Yes. Multiple people can connect to a single instance (concurrent device ). Note that every person connected to the instance will see the......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@FreshZZ Thanks for asking this question. The backend thread sharing is only implemented in case of using GPU instance and with device_blocking execution policy enabled by the backend. See this line: https://github.com/triton-inference-server/core/blob/main/src/backend_model_instance.cc#L314-321 If device_blocking is set false then each TritonModelInstance will create its own
triton_backend_thread_
. Hence achieve full concurrency.See here how Triton core detects that backend has device_blocking execution policy set: https://github.com/triton-inference-server/core/blob/main/src/backend_model.cc#L175-L185
Read more about Device blocking execution policy here: https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h#L781-L794
The behavior of a backend requesting device blocking execution policy is as per what you observed. At present only TensorRT backend uses the Device Blocking execution policy. This is because the execution itself is asynchronous. The implementation of backend is such that even if using only a single backend thread, the backend is able to run mulitple inference requests concurrently.
@tanmayv25 Thank you so much