Is InferContext in python/c++ api related to cuda context?
See original GitHub issueIs InferContext in python/c++ api related to cuda context?
To elaborate my question, I have few use cases as below:
-
For server config,
instance_group=1
andmax_batch_size=8
is set. In python client for reference, I instantiate oneInferContext
object, and concurrently callctx.async_run(batch_size=1)
in eight sub threads. Can server batch the eight requests from one context? -
For server config,
instance_group=1
andmax_batch_size=8
is set. In python client for reference, I instantiate eightInferContext
objects, and concurrently callctx_i.async_run(batch_size=1)
in eight sub threads. Can server batch the eight requests from eight different contexts? -
For server config,
instance_group=8
andmax_batch_size=1
is set. In python client for reference, I instantiate oneInferContext
object, and concurrently callctx.async_run(batch_size=1)
in eight sub threads. Can server dispatch eight requests from one context to eight model instances? -
For server config,
instance_group=8
andmax_batch_size=1
is set. In python client for reference, I instantiate eightInferContext
objects, and concurrently callctx_i.async_run(batch_size=1)
in eight sub threads. Can server dispatch eight requests from eight different contexts to eight model instances?
For a more specific use case, I have a server receiving client requests built by flask. To use python api to communicate with trtis, should I instantiate one InferContext
, and call ctx.async_run()
in each infer request? Or should I instantiate multiple InferContext
(equal to instance_group in server config), and call ctx.async_run()
from different context objects?
Thank you.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (1 by maintainers)
Top GitHub Comments
Confirming that InferContext has nothing to do with CUDA contexts. InferContext is an abstraction in the client libraries. TRTIS uses a single CUDA context for each GPU device that it controls. You may find the architecture section of the doc interesting: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/architecture.html
As far as I know,
InferContext
is not related to cuda context.For the use case, that you explained, it might be better to start a new
InferContext
in a separate thread to not block your client from receiving further api requests. Ideally, I would add some queue where your flask client submits requests, and have multiple workers listening to the queue to process requests using trtis for inference.