Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is InferContext in python/c++ api related to cuda context?

See original GitHub issue

To elaborate my question, I have few use cases as below:

For server config, instance_group=1 and max_batch_size=8 is set. In python client for reference, I instantiate one InferContext object, and concurrently call ctx.async_run(batch_size=1) in eight sub threads. Can server batch the eight requests from one context?
For server config, instance_group=1 and max_batch_size=8 is set. In python client for reference, I instantiate eight InferContext objects, and concurrently call ctx_i.async_run(batch_size=1) in eight sub threads. Can server batch the eight requests from eight different contexts?
For server config, instance_group=8 and max_batch_size=1 is set. In python client for reference, I instantiate one InferContext object, and concurrently call ctx.async_run(batch_size=1) in eight sub threads. Can server dispatch eight requests from one context to eight model instances?
For server config, instance_group=8 and max_batch_size=1 is set. In python client for reference, I instantiate eight InferContext objects, and concurrently call ctx_i.async_run(batch_size=1) in eight sub threads. Can server dispatch eight requests from eight different contexts to eight model instances?

For a more specific use case, I have a server receiving client requests built by flask. To use python api to communicate with trtis, should I instantiate one InferContext, and call ctx.async_run() in each infer request? Or should I instantiate multiple InferContext(equal to instance_group in server config), and call ctx.async_run() from different context objects?

Thank you.

Issue Analytics

State:
Created 4 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

deadeyegoodwincommented, Dec 30, 2019

Confirming that InferContext has nothing to do with CUDA contexts. InferContext is an abstraction in the client libraries. TRTIS uses a single CUDA context for each GPU device that it controls. You may find the architecture section of the doc interesting: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/architecture.html

1reaction

bezerocommented, Dec 26, 2019

As far as I know, InferContext is not related to cuda context.

For this, you can set up Dynamic Batcher which will wait for multiple requests for a certain period of time to combine into one batch.
Same as the first answer.
As far as I know, yes it will try to run each request in different instances based on availability.
Same as the third answer.

For the use case, that you explained, it might be better to start a new InferContext in a separate thread to not block your client from receiving further api requests. Ideally, I would add some queue where your flask client submits requests, and have multiple workers listening to the queue to process requests using trtis for inference.