Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avg request latency to Avg HTTP time up 5000 usec

See original GitHub issue

Description A clear and concise description of what the bug is.

I deployed the onnx model of yolov5 in triton and optimized it with tensorrt, and I tested the tensorrt model of yolov5 in other places. Its inference time is close to Avg request latency, but the increased time of Avg HTTP time is a big problem. I want to know how I can optimize this part of the time。

Triton Information What version of Triton are you using?

21.06.1

Are you using the Triton container or did you build it yourself?

using the Triton container

To Reproduce Steps to reproduce the behavior.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

platform: "onnxruntime_onnx"
max_batch_size: 8
default_model_filename: "model.onnx"
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1,8 ]
  }
]

instance_group [ 
    { 
      count: 1
      gpus:[ 0 ]
    }
]
dynamic_batching { }

model_warmup [
  {
    name: "warmup"
    batch_size: 8
    inputs: {
        key: "images"
        value: {
            data_type: TYPE_FP32
            dims: [ 3, 512, 480 ]
            random_data : True
        }
    }
  }
]
optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    }]
}}

perf_analyzer -m det_onnx --shape images:3,512,480 --concurrency-range 1 --percentile=95

*** Measurement Settings ***
  Batch size: 1
  Using "time_windows" mode for stabilization
  Measurement window: 5000 msec
  Using synchronous calls for inference
  Stabilizing using p95 latency

Request concurrency: 1
  Client: 
    Request count: 217
    Throughput: 43.4 infer/sec
    p50 latency: 23259 usec
    p90 latency: 24276 usec
    p95 latency: 24538 usec
    p99 latency: 24760 usec
    Avg HTTP time: 23013 usec (send/recv 1968 usec + response wait 21045 usec)
  Server: 
    Inference count: 261
    Execution count: 261
    Successful request count: 261
    Avg request latency: 16524 usec (overhead 51 usec + queue 91 usec + compute input 503 usec + compute infer 15733 usec + compute output 146 usec)

Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 43.4 infer/sec, latency 24538 usec

Expected behavior A clear and concise description of what you expected to happen.

I hope that the request time of the model deployed by triton is close to the inference time of my model.

Issue Analytics

State:
Created 2 years ago
Comments:11 (5 by maintainers)

Top GitHub Comments

4reactions

tanmayv25commented, Aug 24, 2021

My understanding is that avg http time = send/recv (including network delay)+response wait time, if the response time you said includes network delay, then what time is send/recv,

The send/recv is client side timing. There is no way for client to know when the data get’s completely delivered to the server…

send/recv is T1 + T7. The response wait time is T2 + T3 + T4 + T5 + T6.

Now, there can be some overlap between T1, T2 and T3. And same goes for T5, T6 and T7 but they can not entirely overlap. Larger the data, lesser is the overlap.

And I think response wait time = avg request latency (server processing time) + XXX, I don’t know what XXX is, and whether shared memory can optimize him

XXXX is T2 + T3 + T5 + T6. In case of the shared memory, the data is not transferred over the network but passed within a shared memory region making request and response json to be much smaller and hence X to be smaller as a result.

I want to use --shared-memory=cuda but I have a problem，It works in another model

You must use --output-shared-memory-size described here: https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/main.cc#L532

1reaction

Tabriziancommented, Aug 20, 2021

If the client and the server are running on the same machine, you can try using the shared memory and CUDA memory pools. HTTP/GRPC compression might also help reduce the networking time. (cc @tanmayv25 / @GuanLuo to correct me if I’m wrong here).

You can find example clients for CUDA memory and system shared memory in the links below:

https://github.com/triton-inference-server/client/blob/r21.07/src/python/examples/simple_http_cudashm_client.py https://github.com/triton-inference-server/client/blob/r21.07/src/python/examples/simple_http_shm_client.py

Top Results From Across the Web

The overhead cost so much time #2137 - GitHub

With your settings, your actual Avg request latency should be less than Avg HTTP time which is 12744us. This entails a maximum overhead...

API Response Time, Explained in 1000 Words or Less

A client will tend to experience more latency when requesting an API server that's 5,000 miles away than 500 miles away. Here's a...

Documentation - Apache Kafka

This config specifies the time, in milliseconds, that the GroupCoordinator will delay the initial consumer rebalance. The rebalance will be further delayed by ......

Network latency and its effect on application performance

Network latency is the time it takes for a packet to travel across the network. It's usually measured and reported as the Round...

Troubleshoot high latency on a DynamoDB table - AWS

I'm seeing an increase in the response time for Amazon DynamoDB requests. Resolution. To troubleshoot high latency on your DynamoDB table, ...