Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how multiple instances on the same device can be concurrent

See original GitHub issue

After reading the code, I was confused about how multiple instances on the same device can be concurrent

Multiple instances on the same device share a TritonBackendThread object, in src/backends/backend/triton_model_instance.cc

model_instances_ save all instances on this device

In the function, TritonModelInstance::TritonBackendThread::BackendThread,

My question is, if there are two instances of model x, A and B on device 0 model_->Server()->GetRateLimiter()->DequeuePayload(model_instances_, &payload); Obtain the payload and assume instance A is assigned to the payload, then, ‘payload->Execute()’, start to forward(),

instance B cannot be assigned payload and execution until the instance A is completed?

Issue Analytics

State:
Created a year ago
Comments:7 (4 by maintainers)

Top GitHub Comments

3reactions

tanmayv25commented, May 5, 2022

@FreshZZ Thanks for asking this question. The backend thread sharing is only implemented in case of using GPU instance and with device_blocking execution policy enabled by the backend. See this line: https://github.com/triton-inference-server/core/blob/main/src/backend_model_instance.cc#L314-321 If device_blocking is set false then each TritonModelInstance will create its own triton_backend_thread_. Hence achieve full concurrency.

See here how Triton core detects that backend has device_blocking execution policy set: https://github.com/triton-inference-server/core/blob/main/src/backend_model.cc#L175-L185

Read more about Device blocking execution policy here: https://github.com/triton-inference-server/core/blob/main/include/triton/core/tritonbackend.h#L781-L794

The behavior of a backend requesting device blocking execution policy is as per what you observed. At present only TensorRT backend uses the Device Blocking execution policy. This is because the execution itself is asynchronous. The implementation of backend is such that even if using only a single backend thread, the backend is able to run mulitple inference requests concurrently.

2reactions

FreshZZcommented, May 7, 2022

@tanmayv25 Thank you so much