question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dynamic batching not working properly while requests waiting in queue

See original GitHub issue

Description

  • Model has dynamic_batching on. max_queue_delay_microseconds is not set, thus default to 0. It is a gpu torchscript model.
  1. Send 5 requests to a model asynchronously.
  2. The model started executing the first request as soon as it arrived
I1126 10:15:28.121351 27161 infer_request.cc:547] prepared: [0x0x7ff6380055c0] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638015b38] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638015b38] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0
...
I1126 10:15:28.121456 27161 libtorch.cc:1347] model main, instance main_0, executing 1 requests
  1. While the model is running, other 4 requests is reported prepared by infer_request.cc and should be waiting in queue
I1126 10:15:28.121460 27161 infer_request.cc:547] prepared: [0x0x7ff638006e70] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638007178] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638007178] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0
I1126 10:15:28.121517 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121526 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121532 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121545 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff668000090
I1126 10:15:28.121550 27161 infer_request.cc:547] prepared: [0x0x7ff638007610] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638007938] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638007938] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:28.121581 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121589 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121595 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121614 27161 infer_request.cc:547] prepared: [0x0x7ff638007e00] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638008108] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638008108] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:28.121643 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121651 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121658 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121676 27161 infer_request.cc:547] prepared: [0x0x7ff638008be0] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638008f98] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638008f98] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0
  1. Next model execution happens 2.94s after the last of the 5 requests arrived, which is a plenty of time for the backend to collect it into a batch of 4 and send it to the model for execution. However, that is not what we’re seeing here. the next execution only executes 1 request. The model execution after that also executes 1 request. The request after that finally does some batching and the model executes 2 requests.
I1126 10:15:31.066632 27161 libtorch.cc:1347] model main, instance main_0, executing 1 requests
I1126 10:15:31.066639 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 1 requests
...
I1126 10:15:31.170789 27161 libtorch.cc:1347] model main, instance main_0, executing 1 requests
I1126 10:15:31.170796 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 1 requests
...
I1126 10:15:31.275147 27161 libtorch.cc:1347] model main, instance main_0, executing 2 requests
I1126 10:15:31.275154 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 2 requests

Now, if I change the max_queue_delay_microseconds to something like 10ms, it does seems to form a batch if the next request comes within that timeframe, but not the other requests after that duration even though the model is still executing. Shouldn’t all requests waiting in queue, while the model executes, be submitted for the next inference?

Triton Information We are using the 21.11 ngc trtis image. It is deployed in a kubernetes cluster and it may have gone through a little bit of modifications but should be pretty much the same. We used the V100 gpu, but this issue seems to be happening on a T4 gpu(21.10 trtis in this case) as well. nvidia-smi command shows me that I’m running this image with nvidia driver 450.102.04 cuda version 11.0 the torchscript model was scripted from pytorch version 1.10.0+cu102

To Reproduce I tried to replicate this issue with a simplest model possible. I used the following script to create a torchscript model.

import torch
import torch.nn as nn

class MainModel(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(
        self,
        inputs,
    ):
        for i in range(10000):
            inputs = inputs * 2
        return inputs
torch.jit.save(torch.jit.script(MainModel()),'models/main/1/model.pt')

The following config.pbtxt was used

name: "main"
backend: "pytorch"

input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
instance_group [{
    kind: KIND_GPU
    count: 1
}]
dynamic_batching {
}
max_batch_size: 10000
parameters: [
    {
        key: "DISABLE_OPTIMIZED_EXECUTION"
            value: {
                string_value:"true"
            }
    }
]

Expected behavior after the first execution with 1 request, the next execution should process 4 requests since they arrived well before the next model execution and the log says they were prepared.

This issue has been causing a huge performance drop. Could you give us guidance on debugging this issue? Thank you

FULL LOG

$ tritonserver --model-repository models --log-verbose 1 --model-control-mode=explicit --load-model main
I1126 10:15:24.232224 27161 metrics.cc:298] Collecting metrics for GPU 0: Tesla V100-SXM2-32GB
I1126 10:15:24.232590 27161 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
I1126 10:15:24.614565 27161 libtorch.cc:1192] TRITONBACKEND_Initialize: pytorch
I1126 10:15:24.614602 27161 libtorch.cc:1202] Triton TRITONBACKEND API version: 1.6
I1126 10:15:24.614612 27161 libtorch.cc:1208] 'pytorch' TRITONBACKEND API version: 1.6
I1126 10:15:24.614661 27161 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so
2021-11-26 19:15:25.059380: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I1126 10:15:25.113301 27161 tensorflow.cc:2170] TRITONBACKEND_Initialize: tensorflow
I1126 10:15:25.113334 27161 tensorflow.cc:2180] Triton TRITONBACKEND API version: 1.6
I1126 10:15:25.113344 27161 tensorflow.cc:2186] 'tensorflow' TRITONBACKEND API version: 1.6
I1126 10:15:25.113353 27161 tensorflow.cc:2210] backend configuration:
{}
I1126 10:15:25.113398 27161 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
I1126 10:15:25.116591 27161 onnxruntime.cc:2157] TRITONBACKEND_Initialize: onnxruntime
I1126 10:15:25.116620 27161 onnxruntime.cc:2167] Triton TRITONBACKEND API version: 1.6
I1126 10:15:25.116630 27161 onnxruntime.cc:2173] 'onnxruntime' TRITONBACKEND API version: 1.6
I1126 10:15:25.129781 27161 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/openvino/libtriton_openvino.so
I1126 10:15:25.146186 27161 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I1126 10:15:25.146208 27161 openvino.cc:1203] Triton TRITONBACKEND API version: 1.6
I1126 10:15:25.146218 27161 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.6
I1126 10:15:25.411769 27161 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7ff668000000' with size 268435456
I1126 10:15:25.412908 27161 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I1126 10:15:25.414399 27161 backend_factory.h:45] Create TritonBackendFactory
I1126 10:15:25.414418 27161 ensemble_backend_factory.cc:47] Create EnsembleBackendFactory
I1126 10:15:25.417917 27161 model_repository_manager.cc:726] AsyncLoad() 'main'
I1126 10:15:25.418211 27161 model_repository_manager.cc:965] TriggerNextAction() 'main' version 1: 1
I1126 10:15:25.418224 27161 model_repository_manager.cc:1003] Load() 'main' version 1
I1126 10:15:25.418233 27161 model_repository_manager.cc:1022] loading: main:1
I1126 10:15:25.518442 27161 model_repository_manager.cc:1082] CreateInferenceBackend() 'main' version 1
I1126 10:15:25.519135 27161 libtorch.cc:1241] TRITONBACKEND_ModelInitialize: main (version 1)
I1126 10:15:25.520296 27161 model_config_utils.cc:1550] ModelConfig 64-bit fields:
I1126 10:15:25.520312 27161 model_config_utils.cc:1552] 	ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds
I1126 10:15:25.520319 27161 model_config_utils.cc:1552] 	ModelConfig::dynamic_batching::max_queue_delay_microseconds
I1126 10:15:25.520326 27161 model_config_utils.cc:1552] 	ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds
I1126 10:15:25.520332 27161 model_config_utils.cc:1552] 	ModelConfig::ensemble_scheduling::step::model_version
I1126 10:15:25.520337 27161 model_config_utils.cc:1552] 	ModelConfig::input::dims
I1126 10:15:25.520343 27161 model_config_utils.cc:1552] 	ModelConfig::input::reshape::shape
I1126 10:15:25.520349 27161 model_config_utils.cc:1552] 	ModelConfig::instance_group::secondary_devices::device_id
I1126 10:15:25.520355 27161 model_config_utils.cc:1552] 	ModelConfig::model_warmup::inputs::value::dims
I1126 10:15:25.520361 27161 model_config_utils.cc:1552] 	ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim
I1126 10:15:25.520367 27161 model_config_utils.cc:1552] 	ModelConfig::optimization::cuda::graph_spec::input::value::dim
I1126 10:15:25.520372 27161 model_config_utils.cc:1552] 	ModelConfig::output::dims
I1126 10:15:25.520378 27161 model_config_utils.cc:1552] 	ModelConfig::output::reshape::shape
I1126 10:15:25.520384 27161 model_config_utils.cc:1552] 	ModelConfig::sequence_batching::direct::max_queue_delay_microseconds
I1126 10:15:25.520390 27161 model_config_utils.cc:1552] 	ModelConfig::sequence_batching::max_sequence_idle_microseconds
I1126 10:15:25.520396 27161 model_config_utils.cc:1552] 	ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds
I1126 10:15:25.520402 27161 model_config_utils.cc:1552] 	ModelConfig::sequence_batching::state::dims
I1126 10:15:25.520407 27161 model_config_utils.cc:1552] 	ModelConfig::version_policy::specific::versions
I1126 10:15:25.520533 27161 libtorch.cc:251] Optimized execution is disabled for model instance 'main'
I1126 10:15:25.520542 27161 libtorch.cc:269] Inference Mode is disabled for model instance 'main'
I1126 10:15:25.520550 27161 libtorch.cc:344] NvFuser is not specified for model instance 'main'
I1126 10:15:25.523385 27161 libtorch.cc:1282] TRITONBACKEND_ModelInstanceInitialize: main_0 (device 0)
I1126 10:15:25.525701 27161 backend_model_instance.cc:105] Creating instance main_0 on GPU 0 (7.0) using artifact 'model.pt'
I1126 10:15:25.546869 27161 triton_model_instance.cc:668] Starting backend thread for main_0 at nice 0 on device 0...
I1126 10:15:25.547031 27161 model_repository_manager.cc:1183] successfully loaded 'main' version 1
I1126 10:15:25.547048 27161 model_repository_manager.cc:965] TriggerNextAction() 'main' version 1: 0
I1126 10:15:25.547055 27161 model_repository_manager.cc:980] no next action, trigger OnComplete()
I1126 10:15:25.547060 27161 dynamic_batch_scheduler.cc:243] Starting dynamic-batcher thread for main at nice 0...
I1126 10:15:25.547079 27161 model_repository_manager.cc:571] VersionStates() 'main'
I1126 10:15:25.547100 27161 model_repository_manager.cc:571] VersionStates() 'main'
I1126 10:15:25.547148 27161 server.cc:522] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1126 10:15:25.547223 27161 server.cc:549] 
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I1126 10:15:25.547250 27161 model_repository_manager.cc:547] BackendStates()
I1126 10:15:25.547289 27161 server.cc:592] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
| main  | 1       | READY  |
+-------+---------+--------+

I1126 10:15:25.547424 27161 tritonserver.cc:1920] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                      |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                     |
| server_version                   | 2.16.0                                                                                                                                     |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda |
|                                  | _shared_memory binary_tensor_data statistics                                                                                               |
| model_repository_path[0]         | models                                                                                                                                     |
| model_control_mode               | MODE_EXPLICIT                                                                                                                              |
| startup_models_0                 | main                                                                                                                                       |
| strict_model_config              | 1                                                                                                                                          |
| rate_limit                       | OFF                                                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                  |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                   |
| response_cache_byte_size         | 0                                                                                                                                          |
| min_supported_compute_capability | 6.0                                                                                                                                        |
| strict_readiness                 | 1                                                                                                                                          |
| exit_timeout                     | 30                                                                                                                                         |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+

I1126 10:15:25.547593 27161 grpc_server.cc:4071] === GRPC KeepAlive Options ===
I1126 10:15:25.547606 27161 grpc_server.cc:4072] keepalive_time_ms: 7200000
I1126 10:15:25.547615 27161 grpc_server.cc:4074] keepalive_timeout_ms: 20000
I1126 10:15:25.547624 27161 grpc_server.cc:4076] keepalive_permit_without_calls: 0
I1126 10:15:25.547633 27161 grpc_server.cc:4078] http2_max_pings_without_data: 2
I1126 10:15:25.547642 27161 grpc_server.cc:4080] http2_min_recv_ping_interval_without_data_ms: 300000
I1126 10:15:25.547651 27161 grpc_server.cc:4083] http2_max_ping_strikes: 2
I1126 10:15:25.547659 27161 grpc_server.cc:4085] ==============================
I1126 10:15:25.548640 27161 grpc_server.cc:225] Ready for RPC 'ServerLive', 0
I1126 10:15:25.548666 27161 grpc_server.cc:225] Ready for RPC 'ServerReady', 0
I1126 10:15:25.548677 27161 grpc_server.cc:225] Ready for RPC 'ModelReady', 0
I1126 10:15:25.548685 27161 grpc_server.cc:225] Ready for RPC 'ServerMetadata', 0
I1126 10:15:25.548694 27161 grpc_server.cc:225] Ready for RPC 'ModelMetadata', 0
I1126 10:15:25.548704 27161 grpc_server.cc:225] Ready for RPC 'ModelConfig', 0
I1126 10:15:25.548711 27161 grpc_server.cc:225] Ready for RPC 'ModelStatistics', 0
I1126 10:15:25.548721 27161 grpc_server.cc:225] Ready for RPC 'SystemSharedMemoryStatus', 0
I1126 10:15:25.548731 27161 grpc_server.cc:225] Ready for RPC 'SystemSharedMemoryRegister', 0
I1126 10:15:25.548741 27161 grpc_server.cc:225] Ready for RPC 'SystemSharedMemoryUnregister', 0
I1126 10:15:25.548749 27161 grpc_server.cc:225] Ready for RPC 'CudaSharedMemoryStatus', 0
I1126 10:15:25.548758 27161 grpc_server.cc:225] Ready for RPC 'CudaSharedMemoryRegister', 0
I1126 10:15:25.548768 27161 grpc_server.cc:225] Ready for RPC 'CudaSharedMemoryUnregister', 0
I1126 10:15:25.548777 27161 grpc_server.cc:225] Ready for RPC 'RepositoryIndex', 0
I1126 10:15:25.548785 27161 grpc_server.cc:225] Ready for RPC 'RepositoryModelLoad', 0
I1126 10:15:25.548796 27161 grpc_server.cc:225] Ready for RPC 'RepositoryModelUnload', 0
I1126 10:15:25.548812 27161 grpc_server.cc:416] Thread started for CommonHandler
I1126 10:15:25.548942 27161 grpc_server.cc:3150] New request handler for ModelInferHandler, 1
I1126 10:15:25.548969 27161 grpc_server.cc:2202] Thread started for ModelInferHandler
I1126 10:15:25.549084 27161 grpc_server.cc:3503] New request handler for ModelStreamInferHandler, 3
I1126 10:15:25.549112 27161 grpc_server.cc:2202] Thread started for ModelStreamInferHandler
I1126 10:15:25.549125 27161 grpc_server.cc:4117] Started GRPCInferenceService at 0.0.0.0:8001
I1126 10:15:25.549415 27161 http_server.cc:2815] Started HTTPService at 0.0.0.0:8000
I1126 10:15:25.590519 27161 http_server.cc:167] Started Metrics Service at 0.0.0.0:8002
I1126 10:15:28.121240 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121289 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121301 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121351 27161 infer_request.cc:547] prepared: [0x0x7ff6380055c0] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638015b38] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638015b38] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:28.121413 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121423 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121430 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121456 27161 libtorch.cc:1347] model main, instance main_0, executing 1 requests
I1126 10:15:28.121460 27161 infer_request.cc:547] prepared: [0x0x7ff638006e70] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638007178] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638007178] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:28.121480 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 1 requests
I1126 10:15:28.121517 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121526 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121532 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121545 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff668000090
I1126 10:15:28.121550 27161 infer_request.cc:547] prepared: [0x0x7ff638007610] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638007938] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638007938] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:28.121581 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121589 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121595 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121614 27161 infer_request.cc:547] prepared: [0x0x7ff638007e00] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638008108] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638008108] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:28.121643 27161 http_server.cc:2727] HTTP request: 2 /v2/models/main/infer
I1126 10:15:28.121651 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121658 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version -1
I1126 10:15:28.121676 27161 infer_request.cc:547] prepared: [0x0x7ff638008be0] request id: , model: main, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7ff638008f98] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
override inputs:
inputs:
[0x0x7ff638008f98] input: INPUT__0, type: INT32, original shape: [1,12], batch + shape: [1,12], shape: [12]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1126 10:15:31.066263 27161 infer_response.cc:165] add response output: output: OUTPUT__0, type: INT32, shape: [1,12]
I1126 10:15:31.066316 27161 http_server.cc:1051] HTTP: unable to provide 'OUTPUT__0' in GPU, will use CPU
I1126 10:15:31.066330 27161 http_server.cc:1071] HTTP using buffer for: 'OUTPUT__0', size: 48, addr: 0x7ff555af3bd0
I1126 10:15:31.066346 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff6680000d0
I1126 10:15:31.066393 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff6680000d0
I1126 10:15:31.066565 27161 http_server.cc:1086] HTTP release: size 48, addr 0x7ff555af3bd0
I1126 10:15:31.066603 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff668000090
I1126 10:15:31.066632 27161 libtorch.cc:1347] model main, instance main_0, executing 1 requests
I1126 10:15:31.066639 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 1 requests
I1126 10:15:31.066661 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff668000090
I1126 10:15:31.170613 27161 infer_response.cc:165] add response output: output: OUTPUT__0, type: INT32, shape: [1,12]
I1126 10:15:31.170637 27161 http_server.cc:1051] HTTP: unable to provide 'OUTPUT__0' in GPU, will use CPU
I1126 10:15:31.170647 27161 http_server.cc:1071] HTTP using buffer for: 'OUTPUT__0', size: 48, addr: 0x7ff555af3bd0
I1126 10:15:31.170656 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff6680000d0
I1126 10:15:31.170684 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff6680000d0
I1126 10:15:31.170752 27161 http_server.cc:1086] HTTP release: size 48, addr 0x7ff555af3bd0
I1126 10:15:31.170770 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff668000090
I1126 10:15:31.170789 27161 libtorch.cc:1347] model main, instance main_0, executing 1 requests
I1126 10:15:31.170796 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 1 requests
I1126 10:15:31.170809 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff668000090
I1126 10:15:31.274920 27161 infer_response.cc:165] add response output: output: OUTPUT__0, type: INT32, shape: [1,12]
I1126 10:15:31.274940 27161 http_server.cc:1051] HTTP: unable to provide 'OUTPUT__0' in GPU, will use CPU
I1126 10:15:31.275019 27161 http_server.cc:1071] HTTP using buffer for: 'OUTPUT__0', size: 48, addr: 0x7ff555af3bd0
I1126 10:15:31.275027 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 48, addr 0x7ff6680000d0
I1126 10:15:31.275052 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff6680000d0
I1126 10:15:31.275115 27161 http_server.cc:1086] HTTP release: size 48, addr 0x7ff555af3bd0
I1126 10:15:31.275131 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff668000090
I1126 10:15:31.275147 27161 libtorch.cc:1347] model main, instance main_0, executing 2 requests
I1126 10:15:31.275154 27161 libtorch.cc:686] TRITONBACKEND_ModelExecute: Running main_0 with 2 requests
I1126 10:15:31.275167 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 96, addr 0x7ff668000090
I1126 10:15:31.377584 27161 infer_response.cc:165] add response output: output: OUTPUT__0, type: INT32, shape: [1,12]
I1126 10:15:31.377604 27161 http_server.cc:1051] HTTP: unable to provide 'OUTPUT__0' in GPU, will use CPU
I1126 10:15:31.377612 27161 http_server.cc:1071] HTTP using buffer for: 'OUTPUT__0', size: 48, addr: 0x7ff555af3bd0
I1126 10:15:31.377620 27161 infer_response.cc:165] add response output: output: OUTPUT__0, type: INT32, shape: [1,12]
I1126 10:15:31.377627 27161 http_server.cc:1051] HTTP: unable to provide 'OUTPUT__0' in GPU, will use CPU
I1126 10:15:31.377633 27161 http_server.cc:1071] HTTP using buffer for: 'OUTPUT__0', size: 48, addr: 0x7ff658022b40
I1126 10:15:31.377641 27161 pinned_memory_manager.cc:161] pinned memory allocation: size 96, addr 0x7ff668000100
I1126 10:15:31.377665 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff668000100
I1126 10:15:31.377729 27161 http_server.cc:1086] HTTP release: size 48, addr 0x7ff555af3bd0
I1126 10:15:31.377762 27161 http_server.cc:1086] HTTP release: size 48, addr 0x7ff658022b40
I1126 10:15:31.377779 27161 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7ff668000090
I1126 10:15:33.111422 27161 http_server.cc:2727] HTTP request: 0 /v2/models/main/stats
I1126 10:15:33.111460 27161 model_repository_manager.cc:571] VersionStates() 'main'
I1126 10:15:33.111476 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version 1
I1126 10:15:33.761620 27161 http_server.cc:2727] HTTP request: 0 /v2/models/main/stats
I1126 10:15:33.761654 27161 model_repository_manager.cc:571] VersionStates() 'main'
I1126 10:15:33.761668 27161 model_repository_manager.cc:615] GetInferenceBackend() 'main' version 1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:18 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
zhaohbcommented, Jan 24, 2022

@tanmayv25 thank you very much, I tested it and found that the performance also improved, it is great!

1reaction
zhaohbcommented, Dec 23, 2021

@tanmayv25 I compared the performance of 21.08 and 21.09, and found that the performance of 21.09 is also worse than 21.08. I think the biggest difference between 21.08 and 21.09 is that the trt of 21.08 is build in, while 21.09 is a separate backend

Read more comments on GitHub >

github_iconTop Results From Across the Web

Increasing throughput using horizontal scaling and action ...
Batching performs more work during each round trip to the service (for example, when you send multiple messages with a single SendMessageBatch request)....
Read more >
Using pipelines for a webserver - Hugging Face
First of all, there's no batch size limit which is usually not a great idea. Next, the timeout is reset on every queue...
Read more >
Dynamic batching - Unity - Manual
Dynamic batching for meshes works by transforming all vertices into world space. on the CPU, rather than on the GPU. This means dynamic...
Read more >
Adaptive Micro Batching — BentoML documentation
While serving a TensorFlow model, batching individual model inference requests together can be important ... it is not worth to wait next inbound...
Read more >
Wait until all jQuery Ajax requests are done? - Stack Overflow
Compared to the when() solution, it has the large disadvantage not to work well together with other components, since it shares a document-wide...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found