question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Triton inference time extremely slow at scale

See original GitHub issue

Description I am deploying a system via apache beam and calling the inference server, which is deployed on GKE, within the pipeline on image data. Initial inference is OK in terms of latency, but as the pipeline moves on and more data is held within the pipeline (more concurrent calls to the server), inference becomes really slow and reaches an average of 15 seconds per inference call.

Triton Information Triton version 2.17 deployed via GCP marketplace

Are you using the Triton container or did you build it yourself? Triton version 2.17 deployed via GCP marketplace

To Reproduce The config for the libtorch model is:

name: "sample"
platform: "pytorch_libtorch"
max_batch_size : 0
input [
  {
    name: "INPUT__0"
    data_type: TYPE_UINT8
    format: FORMAT_NCHW
    dims: [ 3, 512, 512 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ -1, 4 ]
  },
 {
    name: "OUTPUT__1"
    data_type: TYPE_INT64
    dims: [ -1 ]
   label_filename: "sample.txt"
  },
  {
    name: "OUTPUT__2"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
instance_group [
   {
      count : 2
      kind: KIND_GPU
   }
]

The node pool contains 2 nodes, each with its own T4. I am using the grpc async client call within the pipeline to call inference.

Expected behavior I expect inference time to throttle a little bit when the pipeline has a lot of data, but not reach anything near 15 seconds for inference. How can I fix this?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
omrifriedcommented, May 7, 2022

@tanmayv25 I will try to use the inference_statistic on the next call. I am currently working on developing a different model that does not utilize the list of tensors input, and instead uses a batch. With this, I am planning on converting the model to TensorRT due to the known issues with Torchscript which I think may be the root issue.

I will circle back here accordingly.

0reactions
dyastremskycommented, Jun 27, 2022

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Simplifying and Scaling Inference Serving with NVIDIA Triton 2.3
Conclusion. Triton simplifies the deployment of AI and DL models at scale in production. It supports all major frameworks, runs multiple models ...
Read more >
Achieve hyperscale performance for model serving using ...
Triton increases inference performance by maximizing hardware utilization with different optimization techniques (concurrent model runs and ...
Read more >
Increase Huggingface Triton Throughput by 193% - ClearML
The main drawback of this method is the lack of autoscaling, Triton was not designed to handle this kind of setup well. It...
Read more >
Benchmarking Triton (TensorRT) Inference Server for Hosting ...
The non-monotonicity of this graph is primarily due to dynamic batching effects; as concurrent threads match the preferred batch sizes chosen, ...
Read more >
Automated Model-less Inference Serving - USENIX
Applications evolve over time ... Reacting too fast: oscillation; or too slow: SLO ... Static provisioning (TensorFlow Serving, Triton Inference Server).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found