Triton inference time extremely slow at scale
See original GitHub issueDescription I am deploying a system via apache beam and calling the inference server, which is deployed on GKE, within the pipeline on image data. Initial inference is OK in terms of latency, but as the pipeline moves on and more data is held within the pipeline (more concurrent calls to the server), inference becomes really slow and reaches an average of 15 seconds per inference call.
Triton Information Triton version 2.17 deployed via GCP marketplace
Are you using the Triton container or did you build it yourself? Triton version 2.17 deployed via GCP marketplace
To Reproduce The config for the libtorch model is:
name: "sample"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
{
name: "INPUT__0"
data_type: TYPE_UINT8
format: FORMAT_NCHW
dims: [ 3, 512, 512 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ -1, 4 ]
},
{
name: "OUTPUT__1"
data_type: TYPE_INT64
dims: [ -1 ]
label_filename: "sample.txt"
},
{
name: "OUTPUT__2"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
instance_group [
{
count : 2
kind: KIND_GPU
}
]
The node pool contains 2 nodes, each with its own T4. I am using the grpc async client call within the pipeline to call inference.
Expected behavior I expect inference time to throttle a little bit when the pipeline has a lot of data, but not reach anything near 15 seconds for inference. How can I fix this?
Issue Analytics
- State:
- Created a year ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
@tanmayv25 I will try to use the inference_statistic on the next call. I am currently working on developing a different model that does not utilize the list of tensors input, and instead uses a batch. With this, I am planning on converting the model to TensorRT due to the known issues with Torchscript which I think may be the root issue.
I will circle back here accordingly.
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.