Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Asynchronous web client sending request to triton server

See original GitHub issue

In the image_client.py, the requests that are asynchronous are appended to the async_requests list.

 async_requests.append(
       triton_client.async_infer(
       FLAGS.model_name,
       inputs,
       request_id=str(sent_count),
       model_version=FLAGS.model_version,
       outputs=outputs))

Once all the requests have been sent, the client later calls the blocking get_result() on each AsyncInferRequest which blocks the thread.

if FLAGS.async_set:
    # Collect results from the ongoing async requests
    # for HTTP Async requests.
    for async_request in async_requests:
        responses.append(async_request.get_result())

I am trying to implement a web service using the Fast API framework. Within this service I am making calls to triton server to do inference for each request. I would like to know if there is an async/await or callback pattern that triton provides so that I can serve multiple requests concurrently while waiting for the inference results of the current request to be done.

I believe the example provided does not really illustrate the power of async because it is calling the blocking get_result() ?

Hope my question was clear !

Issue Analytics

State:
Created 2 years ago
Comments:14 (3 by maintainers)

Top GitHub Comments

8reactions

fifofonixcommented, Jan 6, 2022

Thanks @alexzubiaga. As it happens, as of a few hours ago, I got the async approach working (via grpc) but without needing to use an internal method. Basically, I imported the aio submodule in grpc and used that to establish the channel.

This seems to work just fine allowing me to use a producer/consumer paradigm where a number of consumers results in concurrent grpc requests. These have proven to be handled just fine for the past several hours although I haven’t yet scaled to a farm of Triton servers yet.

Skeleton code:


import asyncio

import grpc
from grpc import aio

from tritonclient.grpc import service_pb2 as predict_pb2
from tritonclient.grpc import service_pb2_grpc as prediction_service_pb2_grpc

# This called from an asyncio event loop as normal.
def async my_infer(image_np):
    async with aio.insecure_channel("localhost:8001") as channel:
        stub = prediction_service_pb2_grpc.GRPCInferenceServiceStub(channel)
        # Setup the request with input/outputs as normal (not shown to simplify).
        # ...
        response = await stub.ModelInfer(request)
    return response

1reaction

piekey1994commented, Mar 1, 2022

@fifofonix thank you, sadly I’ve had to revert to the synchronous gRPC inference. I am running the code on Python 3.10.2, so this might be one of the reasons. @fifofonix We use grpc synchronized + loop.run_in_executor or await stub.ModelInfer(request) mentioned above, there will be memory leaks. However, this problem does not occur when using loop.run_in_executor + httpclient.

Top Results From Across the Web

NVIDIA Deep Learning Triton Inference Server Documentation

This document provides information about how to set up and run the Triton inference server container, from the prerequisites to running the container....

Client in triton_client::client - Rust - Docs.rs

Create a new triton client for the given url. source. pub async fn server_live(&self) -> Result<ServerLiveResponse, Error>. Check ...

pyotritonclient - PyPI

A lightweight http client library for communicating with Nvidia Triton Inference Server (with Pyodide support in the browser)

Serving Predictions with NVIDIA Triton | Vertex AI

The command might run for several minutes. Prepare payload file for testing prediction requests. To send the container's server a prediction request, prepare ......

Use Triton Inference Server with Amazon SageMaker

These containers include NVIDIA Triton Inference Server, ... These scheduling and batching decisions are transparent to the client requesting inference.