timeouts when using gpu_execution_accelerator 'tensorrt'
See original GitHub issueDescription
When calling the Triton Inference Server (running in a Docker container, using the Python client), requests time out after 60 seconds several times before coming through almost instantly.
After seeing the same behaviour in versions 21.08
, 21.10
, and 21.11
, reading about changes in scheduling logic in this issue lead me to try 21.07
, which does not accept the value “tensorrt” as gpu_execution_accelerator. Removing the “optimization” block from the configuration then solved the problem, all requests now come through immediately, also when run in versions 21.08
-21.11
.
While inference is now slower than expected based on Model Navigator results, at least it works, which makes the issue less pressing. It would of course still be great to be able to use the tensorrt acceleration.
Triton Information What version of Triton are you using?
nvcr.io/nvidia/tritonserver:{21.07, 21.08, 21.09, 21.10, 21.11}-py3
Are you using the Triton container or did you build it yourself?
Triton container
To Reproduce
After succesful conversion of the model from Torch to TorchScript, I generate Triton configurations using the Model Navigator, then deploy the Triton Inference Server using a Docker container. The fastest configuration is ts2onnx_op13.trt_fp16
, the generated configuration is as follows:
name: "dense.ts2onnx_op13.trt_fp16"
max_batch_size: 256
optimization {
execution_accelerators {
gpu_execution_accelerator {
name: "tensorrt"
parameters {
key: "max_workspace_size_bytes"
value: "4294967296"
}
parameters {
key: "precision_mode"
value: "FP16"
}
}
}
}
backend: "onnx"
I’ve adapted the Kubernetes Deployment from the Helm chart to a local Docker command, mounting the ‘final-model-store’ directory from the Model Navigator output:
docker run --rm -it --ipc=host --gpus all \
-p 8000:8000 -p 8001:8001 \
-e MODEL_REPOSITORY_PATH=/mnt/triton-models \
-v $PWD/final-model-store:/mnt/triton-models \
nvcr.io/nvidia/tritonserver:21.11-py3 \
tritonserver \
--model-store=/mnt/triton-models \
--model-control-mode=none \
--strict-model-config=false \
--allow-metrics=false \
--allow-gpu-metrics=false \
--log-verbose=5
I then call the server using the Python client:
import tritonclient.http as triton_client
client = triton_client.InferenceServerClient(url="localhost:8000")
# ...input preparation...
attempts = 1
for i in range(10):
start = time.time()
try:
results = client.infer(model_name, inputs, outputs=outputs, priority=1)
except Exception as e:
print(f"Attempt took {time.time() - start} seconds, error was: {e}")
attempts += 1
continue
print(f"Successful inference took {time.time() - start} seconds ({attempts} attempts)")
print(client.get_inference_statistics(model_name=model_name))
# ...output processing...
Client output:
Attempt took 60.06368064880371 seconds, error was: timed out
Attempt took 60.060895681381226 seconds, error was: timed out
Successful inference took 0.005013465881347656 seconds (3 attempts)
Statistics output:
- counts are in multiples of 10, even though single examples are sent by client
{'model_stats': [{'batch_stats': [{'batch_size': 1,
'compute_infer': {'count': 10,
'ns': 158330765850},
'compute_input': {'count': 10, 'ns': 559939},
'compute_output': {'count': 10,
'ns': 600146}}],
'execution_count': 10,
'inference_count': 10,
'inference_stats': {'compute_infer': {'count': 10,
'ns': 158330765850},
'compute_input': {'count': 10,
'ns': 559939},
'compute_output': {'count': 10,
'ns': 600146},
'fail': {'count': 0, 'ns': 0},
'queue': {'count': 10,
'ns': 136415637221},
'success': {'count': 10,
'ns': 294747742835}},
'last_inference': 1639406695141,
'name': 'dense.ts2onnx_op13.trt_fp16',
'version': '1'}]}
Verbose Triton Server model loading logs:
I1213 16:19:32.875780 1 autofill.cc:138] TensorFlow SavedModel autofill: Internal: unable to autofill for 'dense.ts2onnx_op13.trt_fp16', unable to find savedmodel directory named 'model.savedmodel'
I1213 16:19:32.875795 1 autofill.cc:151] TensorFlow GraphDef autofill: Internal: unable to autofill for 'dense.ts2onnx_op13.trt_fp16', unable to find graphdef file named 'model.graphdef'
I1213 16:19:32.875809 1 autofill.cc:164] PyTorch autofill: Internal: unable to autofill for 'dense.ts2onnx_op13.trt_fp16', unable to find PyTorch file named 'model.pt'
I1213 16:19:32.875822 1 autofill.cc:196] ONNX autofill: OK:
I1213 16:19:32.875825 1 model_config_utils.cc:666] autofilled config: name: "dense.ts2onnx_op13.trt_fp16"
platform: "onnxruntime_onnx"
max_batch_size: 256
optimization {
execution_accelerators {
gpu_execution_accelerator {
name: "tensorrt"
parameters {
key: "max_workspace_size_bytes"
value: "4294967296"
}
parameters {
key: "precision_mode"
value: "FP16"
}
}
}
}
backend: "onnxruntime"
Verbose Triton Server output (timed out request):
- I’ve looked up the timing cache message, and from https://github.com/NVIDIA/TensorRT/issues/1413 I understand this should not be the issue.
I1213 16:19:42.045585 1 http_server.cc:2727] HTTP request: 2 /v2/models/dense.ts2onnx_op13.trt_fp16/infer
I1213 16:19:42.045636 1 model_repository_manager.cc:615] GetInferenceBackend() 'dense.ts2onnx_op13.trt_fp16' version -1
I1213 16:19:42.045647 1 model_repository_manager.cc:615] GetInferenceBackend() 'dense.ts2onnx_op13.trt_fp16' version -1
I1213 16:19:42.045694 1 infer_request.cc:524] prepared: [0x0x7fb8f8003200] request id: , model: dense.ts2onnx_op13.trt_fp16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fb8f80037d8] input: INPUT__0, type: FP32, original shape: [1,80,327], batch + shape: [1,80,327], shape: [80,327]
override inputs:
inputs:
[0x0x7fb8f80037d8] input: INPUT__0, type: FP32, original shape: [1,80,327], batch + shape: [1,80,327], shape: [80,327]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0
I1213 16:19:42.045797 1 onnxruntime.cc:2167] model dense.ts2onnx_op13.trt_fp16, instance dense.ts2onnx_op13.trt_fp16, executing 1 requests
I1213 16:19:42.045824 1 onnxruntime.cc:1159] TRITONBACKEND_ModelExecute: Running dense.ts2onnx_op13.trt_fp16 with 1 requests
2021-12-13 16:19:42.047277688 [I:onnxruntime:log, bfc_arena.cc:26 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-12-13 16:19:42.047294088 [V:onnxruntime:log, bfc_arena.cc:62 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-12-13 16:19:42.047346387 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution
2021-12-13 16:19:42.660233702 [W:onnxruntime:log, tensorrt_execution_provider.h:53 log] [2021-12-13 16:19:42 WARNING] Detected invalid timing cache, setup a local cache instead
(1 min no logs, then start of next request)
I1213 16:20:42.105574 1 http_server.cc:2727] HTTP request: 2 /v2/models/dense.ts2onnx_op13.trt_fp16/infer
Other things I tried:
- I first inference a single example as warm-up, this one usually times out more often than subsequent calls
- model-control-mode={POLL,none}
- add ‘priority=1’ to client.infer()
- grpc client
- added dynamic_batching config to config.pbtxt
- disabled dynamic_batching and optimized execution as in https://github.com/triton-inference-server/server/issues/3624
# modification 1
dynamic_batching {
preferred_batch_size: [ 1 ]
max_queue_delay_microseconds: 1000
default_queue_policy {
max_queue_size: 1
}
}
# modification 2
dynamic_batching {}
parameters: [
{
key: "DISABLE_OPTIMIZED_EXECUTION"
value: {
string_value:"true"
}
}
]
This issue describes similar behaviour, however, it has no solution.
Should any part of the model creation/analysis be useful in debugging I am happy to share more detail where necessary.
Thanks for the awesome tool and for any help in this matter!
Expected behavior Requests from the client do not time out
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
TensorRT engine for dynamic shapes is built for a certain opt_shapes. However, there optimization profile in TensorRT does allow users to provide a range of shapes for these dynamic dimensions. Now, the question becomes how does ONNXRT handles TensorRT EP? Can we provide it a range of values for these dynamic dimensions?
Answer to the first question can be found here:
Looking at this documentation, it does appear the engine will be recreated to include input shapes not yet encountered in the range. @gradient-ascent-ai-lab I am not sure whether this would work or not as I have not looked into the TensorRT EP implementation in ONNXRT. However, going by the language in their documentation, I think if you send first inference with minimum shape and the second inference with max shape, the other inferences that lie between min and max shapes should not trigger TRT engine recreation. Can you confirm this?
@tanmayv25 Thank you for your fast response. It appears this has solved the issue! 🎉
Not being aware of these dynamics, I see now that I had set the
min/opt/max
shapes rather low compared to real data, the initial model configuration sets the dynamic dimension shapes as40/140/300
, whereas real data is actually more often in the range of 300-800.I have re-run Model Navigator with new values of
1/500/1000
and have created warmup samples of lengths 1 and 1000. Inferencing these first, I get the following output:Now that this works, I will look into adding these warmup samples into the configuration to remove the need for the client to send these requests.
Thank you both for a fast resolution of this issue!