Dynamic Batching not creating batches correctly and incorrect inference results
See original GitHub issueDescription I am deploying a triton server to GKE via the gke-marketplace-app documentation. When I try to use dynamic batching, the requests are not batched and it is only sent with a batch size of 1. Additionally, the inference only results in one detection when it should be multiple.
Triton Information The version is 2.17 as this is what the marketplace feature deploys.
Are you using the Triton container or did you build it yourself? Deployed via gcp marketplace
To Reproduce I create the inference server with the following config:
name: "sample"
platform: "pytorch_libtorch"
max_batch_size : 16
input [
{
name: "INPUT__0"
data_type: TYPE_UINT8
format: FORMAT_NCHW
dims: [ 3, 512, 512 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ -1, 4 ]
},
{
name: "OUTPUT__1"
data_type: TYPE_INT64
dims: [ -1 ]
label_filename: "sample.txt"
},
{
name: "OUTPUT__2"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
dynamic_batching {
max_queue_delay_microseconds: 50000
}
I am calling inference as follows:
model = "sample"
client = httpclient.InferenceServerClient( url = url )
input_1 = httpclient.InferInput(name = "INPUT__0", shape = list(data.shape), datatype = "UINT8")
input_2 = httpclient.InferInput(name = "INPUT__0", shape = list(data.shape), datatype = "UINT8")
input_1.set_data_from_numpy(data, binary_data = True)
input_2.set_data_from_numpy(data, binary_data = True)
output_00 = httpclient.InferRequestedOutput(name = "OUTPUT__0", binary_data = False)
output_01 = httpclient.InferRequestedOutput(name = "OUTPUT__1", binary_data = False)
output_02 = httpclient.InferRequestedOutput(name = "OUTPUT__2", binary_data = False)
output_10 = httpclient.InferRequestedOutput(name = "OUTPUT__0", binary_data = False)
output_11 = httpclient.InferRequestedOutput(name = "OUTPUT__1", binary_data = False)
output_12 = httpclient.InferRequestedOutput(name = "OUTPUT__2", binary_data = False)
# Is this correct? I tried using reshape in the config, but it did not work. Without this I get errors about data shape.
input_1.set_shape([1, 3, 512, 512]
input_2.set_shape([1, 3, 512, 512]
response_1 = client.async_infer(model_name = model, inputs = [input_1], outputs = [output_00, output_01, output_02])
response_2 = client.async_infer(model_name = model, inputs = [input_2], outputs = [output_10, output_11, output_12])
Expected behavior
With the above code, when I run print(response_1.get_result().get_response())
I am only seeing one detection, but I know that the model detects multiple objects during direct inference on local:
{... [{'name': 'OUTPUT__0', 'datatype': 'FP32', 'shape': [1, 4], 'data': [x_min, y_min, x_max, y_max]}, ...}
Additionally, when I run print(client.get_inference_statistics())
I am seeing only a batch size of 1 when I expect 2 in this case:
{ ... 'batch_stats': [{'batch_size': 1, 'compute_input' : {'count': 2 ...}}] ... }
Issue Analytics
- State:
- Created a year ago
- Comments:28 (17 by maintainers)
Top GitHub Comments
Hi @omrifried ,
Thanks for the reference. Do you mind sharing
(I saw some pieces of these above, but having complete versions would help greatly to save time to look into this, thanks.)
Ticket ref: DLIS-3633
Also note that for HTTP Python client, you will need to set the
concurrency
for sending requests concurrently https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_async_infer_client.py#L55-L58