question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Asynchronous web client sending request to triton server

See original GitHub issue

In the image_client.py, the requests that are asynchronous are appended to the async_requests list.

 async_requests.append(
       triton_client.async_infer(
       FLAGS.model_name,
       inputs,
       request_id=str(sent_count),
       model_version=FLAGS.model_version,
       outputs=outputs))

Once all the requests have been sent, the client later calls the blocking get_result() on each AsyncInferRequest which blocks the thread.

if FLAGS.async_set:
    # Collect results from the ongoing async requests
    # for HTTP Async requests.
    for async_request in async_requests:
        responses.append(async_request.get_result())

I am trying to implement a web service using the Fast API framework. Within this service I am making calls to triton server to do inference for each request. I would like to know if there is an async/await or callback pattern that triton provides so that I can serve multiple requests concurrently while waiting for the inference results of the current request to be done.

I believe the example provided does not really illustrate the power of async because it is calling the blocking get_result() ?

Hope my question was clear !

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (3 by maintainers)

github_iconTop GitHub Comments

8reactions
fifofonixcommented, Jan 6, 2022

Thanks @alexzubiaga. As it happens, as of a few hours ago, I got the async approach working (via grpc) but without needing to use an internal method. Basically, I imported the aio submodule in grpc and used that to establish the channel.

This seems to work just fine allowing me to use a producer/consumer paradigm where a number of consumers results in concurrent grpc requests. These have proven to be handled just fine for the past several hours although I haven’t yet scaled to a farm of Triton servers yet.

Skeleton code:


import asyncio

import grpc
from grpc import aio

from tritonclient.grpc import service_pb2 as predict_pb2
from tritonclient.grpc import service_pb2_grpc as prediction_service_pb2_grpc

# This called from an asyncio event loop as normal.
def async my_infer(image_np):
    async with aio.insecure_channel("localhost:8001") as channel:
        stub = prediction_service_pb2_grpc.GRPCInferenceServiceStub(channel)
        # Setup the request with input/outputs as normal (not shown to simplify).
        # ...
        response = await stub.ModelInfer(request)
    return response
1reaction
piekey1994commented, Mar 1, 2022

@fifofonix thank you, sadly I’ve had to revert to the synchronous gRPC inference. I am running the code on Python 3.10.2, so this might be one of the reasons. @fifofonix We use grpc synchronized + loop.run_in_executor or await stub.ModelInfer(request) mentioned above, there will be memory leaks. However, this problem does not occur when using loop.run_in_executor + httpclient.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVIDIA Deep Learning Triton Inference Server Documentation
This document provides information about how to set up and run the Triton inference server container, from the prerequisites to running the container....
Read more >
Client in triton_client::client - Rust - Docs.rs
Create a new triton client for the given url. source. pub async fn server_live(&self) -> Result<ServerLiveResponse, Error>. Check ...
Read more >
pyotritonclient - PyPI
A lightweight http client library for communicating with Nvidia Triton Inference Server (with Pyodide support in the browser)
Read more >
Serving Predictions with NVIDIA Triton | Vertex AI
The command might run for several minutes. Prepare payload file for testing prediction requests. To send the container's server a prediction request, prepare ......
Read more >
Use Triton Inference Server with Amazon SageMaker
These containers include NVIDIA Triton Inference Server, ... These scheduling and batching decisions are transparent to the client requesting inference.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found