question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is InferContext in python/c++ api related to cuda context?

See original GitHub issue

Is InferContext in python/c++ api related to cuda context?

To elaborate my question, I have few use cases as below:

  1. For server config, instance_group=1 and max_batch_size=8 is set. In python client for reference, I instantiate one InferContext object, and concurrently call ctx.async_run(batch_size=1) in eight sub threads. Can server batch the eight requests from one context?

  2. For server config, instance_group=1 and max_batch_size=8 is set. In python client for reference, I instantiate eight InferContext objects, and concurrently call ctx_i.async_run(batch_size=1) in eight sub threads. Can server batch the eight requests from eight different contexts?

  3. For server config, instance_group=8 and max_batch_size=1 is set. In python client for reference, I instantiate one InferContext object, and concurrently call ctx.async_run(batch_size=1) in eight sub threads. Can server dispatch eight requests from one context to eight model instances?

  4. For server config, instance_group=8 and max_batch_size=1 is set. In python client for reference, I instantiate eight InferContext objects, and concurrently call ctx_i.async_run(batch_size=1) in eight sub threads. Can server dispatch eight requests from eight different contexts to eight model instances?

For a more specific use case, I have a server receiving client requests built by flask. To use python api to communicate with trtis, should I instantiate one InferContext, and call ctx.async_run() in each infer request? Or should I instantiate multiple InferContext(equal to instance_group in server config), and call ctx.async_run() from different context objects?

Thank you.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
deadeyegoodwincommented, Dec 30, 2019

Confirming that InferContext has nothing to do with CUDA contexts. InferContext is an abstraction in the client libraries. TRTIS uses a single CUDA context for each GPU device that it controls. You may find the architecture section of the doc interesting: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/architecture.html

1reaction
bezerocommented, Dec 26, 2019

As far as I know, InferContext is not related to cuda context.

  1. For this, you can set up Dynamic Batcher which will wait for multiple requests for a certain period of time to combine into one batch.
  2. Same as the first answer.
  3. As far as I know, yes it will try to run each request in different instances based on availability.
  4. Same as the third answer.

For the use case, that you explained, it might be better to start a new InferContext in a separate thread to not block your client from receiving further api requests. Ideally, I would add some queue where your flask client submits requests, and have multiple workers listening to the queue to process requests using trtis for inference.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is InferContext in python/c++ api related to cuda context?
In python client for reference, I instantiate one InferContext object, and concurrently call ctx.async_run(batch_size=1) in eight sub threads.
Read more >
Batch Inference Wrong in Python API - TensorRT
I have a TF model format as pb and convert to onnx with shape (1, 112, 112, 3), then using onnx2trt to generate...
Read more >
Profile for Ubuntu Inc.
Linknovate profile for Ubuntu Inc. : Ubuntu is a Debian-based open source Linux operating system for PC, tablet, phone, and cloud platforms.
Read more >
The training_extensions from openvinotoolkit - GithubHelp
Overview. OpenVINO™ Training Extensions (OTE) is command-line interface (CLI) framework designed for low-code deep learning model training.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found