is the dynamic batcher setting sucess?
See original GitHub issueI use the electra onnx model to get sentence representation. This is the config.pbtxt
platform: "onnxruntime_onnx"
backend: "onnxruntime"
dynamic_batching {
preferred_batch_size: [ 4, 8, 32 ]
max_queue_delay_microseconds: 1000
}
version_policy: {
latest: {
num_versions: 1
}
}
max_batch_size: 100
input: [
{
name: "token_type_ids"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [8]
is_shape_tensor: false
allow_ragged_batch: false
},
{
name: "attention_mask"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [8]
is_shape_tensor: false
allow_ragged_batch: false
},
{
name: "input_ids"
data_type: TYPE_INT64
format: FORMAT_NONE
dims: [8],
is_shape_tensor: false,
allow_ragged_batch: false
}
]
output: [
{
name: "output_electra"
data_type: TYPE_FP32
dims: [8,256]
label_filename: ""
is_shape_tensor: false
}
]
batch_input: []
batch_output: []
optimization: {
priority: PRIORITY_DEFAULT
input_pinned_memory: {
enable: true
}
output_pinned_memory: {
enable: true
}
}
instance_group: [
{
name: "electra_onnx_model"
kind: KIND_GPU
count: 1
gpus: [0]
profile: []
}
]
default_model_filename: "model.onnx"
cc_model_filenames: {}
metric_tags: {}
parameters: {}
model_warmup: []```
I use the http to get result with path /v2/models/electra_onnx_model/infer.
I can get correct response with input shape [bacthsize,8].
but I wonder if the dynamic batcher setting correctly?
when using jmeter to test , it seems only process one request each time, not combined to a batch
```I0906 07:57:06.702650 1 http_server.cc:1229] HTTP request: 2 /v2/models/electra_onnx_model/infer
I0906 07:57:06.702691 1 model_repository_manager.cc:496] GetInferenceBackend() 'electra_onnx_model' version -1
I0906 07:57:06.702704 1 model_repository_manager.cc:496] GetInferenceBackend() 'electra_onnx_model' version -1
I0906 07:57:06.702763 1 infer_request.cc:502] prepared: [0x0x7f228801d9d0] request id: , model: electra_onnx_model, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f228801f4d8] input: token_type_ids, type: INT64, original shape: [1,8], batch + shape: [1,8], shape: [8]
[0x0x7f228801d828] input: attention_mask, type: INT64, original shape: [1,8], batch + shape: [1,8], shape: [8]
[0x0x7f2288012ca8] input: input_ids, type: INT64, original shape: [1,8], batch + shape: [1,8], shape: [8]
override inputs:
inputs:
[0x0x7f2288012ca8] input: input_ids, type: INT64, original shape: [1,8], batch + shape: [1,8], shape: [8]
[0x0x7f228801d828] input: attention_mask, type: INT64, original shape: [1,8], batch + shape: [1,8], shape: [8]
[0x0x7f228801f4d8] input: token_type_ids, type: INT64, original shape: [1,8], batch + shape: [1,8], shape: [8]
original requested outputs:
output_electra
requested outputs:
output_electra
I0906 07:57:06.702829 1 onnxruntime.cc:1896] model electra_onnx_model, instance electra_onnx_model, executing 1 requests
I0906 07:57:06.702844 1 onnxruntime.cc:940] TRITONBACKEND_ModelExecute: Running electra_onnx_model with 1 requests
I0906 07:57:06.702858 1 pinned_memory_manager.cc:131] pinned memory allocation: size 64, addr 0x7f237e000090
I0906 07:57:06.702876 1 pinned_memory_manager.cc:131] pinned memory allocation: size 64, addr 0x7f237e0000e0
I0906 07:57:06.702884 1 pinned_memory_manager.cc:131] pinned memory allocation: size 64, addr 0x7f237e000130
2021-09-06 07:57:06.703016809 [I:onnxruntime:, sequential_executor.cc:157 Execute] Begin execution
2021-09-06 07:57:06.703764739 [I:onnxruntime:, sequential_executor.cc:469 Execute] [Memory] ExecutionFrame statically allocates 98368 bytes for Cuda
2021-09-06 07:57:06.703774308 [I:onnxruntime:, sequential_executor.cc:469 Execute] [Memory] ExecutionFrame statically allocates 64 bytes for Cpu
2021-09-06 07:57:06.703778428 [I:onnxruntime:, sequential_executor.cc:469 Execute] [Memory] ExecutionFrame statically allocates 64 bytes for CUDA_CPU
2021-09-06 07:57:06.703782619 [I:onnxruntime:, sequential_executor.cc:474 Execute] [Memory] ExecutionFrame dynamically allocates 8192 bytes for Cuda
I0906 07:57:06.704385 1 infer_response.cc:165] add response output: output: output_electra, type: FP32, shape: [1,8,256]
I0906 07:57:06.704400 1 http_server.cc:1200] HTTP using buffer for: 'output_electra', size: 8192, addr: 0x7f22c427a140
I0906 07:57:06.704736 1 http_server.cc:1215] HTTP release: size 8192, addr 0x7f22c427a140
I0906 07:57:06.704752 1 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f237e000090
I0906 07:57:06.704762 1 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f237e0000e0
I0906 07:57:06.704770 1 pinned_memory_manager.cc:158] pinned memory deallocation: addr 0x7f237e000130```
any wrong with it , how to set dynamic batcher correctly?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Optimize your inference jobs using dynamic batch ... - AWS
Batch processing can increase throughput and optimize your resources because it helps complete a larger number of inferences in a certain amount ...
Read more >[Question] Setting dynamic batching with warmup #4373
Hi, I'm trying to deploy an MMDetection yolox-s model which I converted to an end2end.engine file using MMDeploy.
Read more >Setting Dynamic Batch/Image Size and Dynamic AIPP
Set the dynamic batch/image size choices, and dynamic AIPP parameters after the model is successfully loaded before model execution.
Read more >Allowed batch size for Dynamic Batch Size confusing
The app will iterate over all batch sizes to show success and error. ... To set the dynamic batch size in between the...
Read more >Adaptive Stream Processing using Dynamic Batch Sizing
Thus, it is important to dynamically adapt the batch interval as the operating conditions demand. Figure 4c illustrates our goal.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Are you creating sufficient concurrent requests to the server using jmeter? If the intervals between requests is greater than 1000us then there will not be any batching. If you have sufficient request concurrency, then try increasing the
max_queue_delay_microseconds
parameter.Looks like this issue has been resolved. Please re-open if you have more questions.