Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

timeouts when using gpu_execution_accelerator 'tensorrt'

See original GitHub issue

Description

When calling the Triton Inference Server (running in a Docker container, using the Python client), requests time out after 60 seconds several times before coming through almost instantly.

After seeing the same behaviour in versions 21.08, 21.10, and 21.11, reading about changes in scheduling logic in this issue lead me to try 21.07, which does not accept the value “tensorrt” as gpu_execution_accelerator. Removing the “optimization” block from the configuration then solved the problem, all requests now come through immediately, also when run in versions 21.08-21.11.

While inference is now slower than expected based on Model Navigator results, at least it works, which makes the issue less pressing. It would of course still be great to be able to use the tensorrt acceleration.

Triton Information What version of Triton are you using?

nvcr.io/nvidia/tritonserver:{21.07, 21.08, 21.09, 21.10, 21.11}-py3

Are you using the Triton container or did you build it yourself?

Triton container

To Reproduce

After succesful conversion of the model from Torch to TorchScript, I generate Triton configurations using the Model Navigator, then deploy the Triton Inference Server using a Docker container. The fastest configuration is ts2onnx_op13.trt_fp16, the generated configuration is as follows:

name: "dense.ts2onnx_op13.trt_fp16"
max_batch_size: 256
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "max_workspace_size_bytes"
        value: "4294967296"
      }
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
    }
  }
}
backend: "onnx"

I’ve adapted the Kubernetes Deployment from the Helm chart to a local Docker command, mounting the ‘final-model-store’ directory from the Model Navigator output:

docker run --rm -it --ipc=host --gpus all \
    -p 8000:8000 -p 8001:8001 \
    -e MODEL_REPOSITORY_PATH=/mnt/triton-models \
    -v $PWD/final-model-store:/mnt/triton-models \
    nvcr.io/nvidia/tritonserver:21.11-py3 \
        tritonserver \
            --model-store=/mnt/triton-models \
            --model-control-mode=none \
            --strict-model-config=false \
            --allow-metrics=false \
            --allow-gpu-metrics=false \
            --log-verbose=5

I then call the server using the Python client:

import tritonclient.http as triton_client

client = triton_client.InferenceServerClient(url="localhost:8000")

# ...input preparation...

attempts = 1
for i in range(10):
    start = time.time()
    try:
        results = client.infer(model_name, inputs, outputs=outputs, priority=1)
    except Exception as e:
        print(f"Attempt took {time.time() - start} seconds, error was: {e}")
        attempts += 1
        continue
 
print(f"Successful inference took {time.time() - start} seconds ({attempts} attempts)")
print(client.get_inference_statistics(model_name=model_name))
 
# ...output processing...

Client output:

Attempt took 60.06368064880371 seconds, error was: timed out
Attempt took 60.060895681381226 seconds, error was: timed out
Successful inference took 0.005013465881347656 seconds (3 attempts)

Statistics output:

counts are in multiples of 10, even though single examples are sent by client

{'model_stats': [{'batch_stats': [{'batch_size': 1,
                                   'compute_infer': {'count': 10,
                                                     'ns': 158330765850},
                                   'compute_input': {'count': 10, 'ns': 559939},
                                   'compute_output': {'count': 10,
                                                      'ns': 600146}}],
                  'execution_count': 10,
                  'inference_count': 10,
                  'inference_stats': {'compute_infer': {'count': 10,
                                                        'ns': 158330765850},
                                      'compute_input': {'count': 10,
                                                        'ns': 559939},
                                      'compute_output': {'count': 10,
                                                         'ns': 600146},
                                      'fail': {'count': 0, 'ns': 0},
                                      'queue': {'count': 10,
                                                'ns': 136415637221},
                                      'success': {'count': 10,
                                                  'ns': 294747742835}},
                  'last_inference': 1639406695141,
                  'name': 'dense.ts2onnx_op13.trt_fp16',
                  'version': '1'}]}

Verbose Triton Server model loading logs:

I1213 16:19:32.875780 1 autofill.cc:138] TensorFlow SavedModel autofill: Internal: unable to autofill for 'dense.ts2onnx_op13.trt_fp16', unable to find savedmodel directory named 'model.savedmodel'
I1213 16:19:32.875795 1 autofill.cc:151] TensorFlow GraphDef autofill: Internal: unable to autofill for 'dense.ts2onnx_op13.trt_fp16', unable to find graphdef file named 'model.graphdef'
I1213 16:19:32.875809 1 autofill.cc:164] PyTorch autofill: Internal: unable to autofill for 'dense.ts2onnx_op13.trt_fp16', unable to find PyTorch file named 'model.pt'
I1213 16:19:32.875822 1 autofill.cc:196] ONNX autofill: OK: 
I1213 16:19:32.875825 1 model_config_utils.cc:666] autofilled config: name: "dense.ts2onnx_op13.trt_fp16"
platform: "onnxruntime_onnx"
max_batch_size: 256
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "max_workspace_size_bytes"
        value: "4294967296"
      }
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
    }
  }
}
backend: "onnxruntime"

Verbose Triton Server output (timed out request):

I’ve looked up the timing cache message, and from https://github.com/NVIDIA/TensorRT/issues/1413 I understand this should not be the issue.

I1213 16:19:42.045585 1 http_server.cc:2727] HTTP request: 2 /v2/models/dense.ts2onnx_op13.trt_fp16/infer
I1213 16:19:42.045636 1 model_repository_manager.cc:615] GetInferenceBackend() 'dense.ts2onnx_op13.trt_fp16' version -1
I1213 16:19:42.045647 1 model_repository_manager.cc:615] GetInferenceBackend() 'dense.ts2onnx_op13.trt_fp16' version -1
I1213 16:19:42.045694 1 infer_request.cc:524] prepared: [0x0x7fb8f8003200] request id: , model: dense.ts2onnx_op13.trt_fp16, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fb8f80037d8] input: INPUT__0, type: FP32, original shape: [1,80,327], batch + shape: [1,80,327], shape: [80,327]
override inputs:
inputs:
[0x0x7fb8f80037d8] input: INPUT__0, type: FP32, original shape: [1,80,327], batch + shape: [1,80,327], shape: [80,327]
original requested outputs:
OUTPUT__0
requested outputs:
OUTPUT__0

I1213 16:19:42.045797 1 onnxruntime.cc:2167] model dense.ts2onnx_op13.trt_fp16, instance dense.ts2onnx_op13.trt_fp16, executing 1 requests
I1213 16:19:42.045824 1 onnxruntime.cc:1159] TRITONBACKEND_ModelExecute: Running dense.ts2onnx_op13.trt_fp16 with 1 requests
2021-12-13 16:19:42.047277688 [I:onnxruntime:log, bfc_arena.cc:26 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2021-12-13 16:19:42.047294088 [V:onnxruntime:log, bfc_arena.cc:62 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2021-12-13 16:19:42.047346387 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution
2021-12-13 16:19:42.660233702 [W:onnxruntime:log, tensorrt_execution_provider.h:53 log] [2021-12-13 16:19:42 WARNING] Detected invalid timing cache, setup a local cache instead

(1 min no logs, then start of next request)

I1213 16:20:42.105574 1 http_server.cc:2727] HTTP request: 2 /v2/models/dense.ts2onnx_op13.trt_fp16/infer

Other things I tried:

I first inference a single example as warm-up, this one usually times out more often than subsequent calls
model-control-mode={POLL,none}
add ‘priority=1’ to client.infer()
grpc client
added dynamic_batching config to config.pbtxt
disabled dynamic_batching and optimized execution as in https://github.com/triton-inference-server/server/issues/3624

# modification 1
dynamic_batching { 
    preferred_batch_size: [ 1 ]
    max_queue_delay_microseconds: 1000
    default_queue_policy {
        max_queue_size: 1
    }
}

# modification 2
dynamic_batching {}
parameters: [
    {
        key: "DISABLE_OPTIMIZED_EXECUTION"
            value: {
                string_value:"true"
            }
    }
]

This issue describes similar behaviour, however, it has no solution.

Should any part of the model creation/analysis be useful in debugging I am happy to share more detail where necessary.

Thanks for the awesome tool and for any help in this matter!

Expected behavior Requests from the client do not time out

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

3reactions

tanmayv25commented, Dec 15, 2021

TensorRT engine for dynamic shapes is built for a certain opt_shapes. However, there optimization profile in TensorRT does allow users to provide a range of shapes for these dynamic dimensions. Now, the question becomes how does ONNXRT handles TensorRT EP? Can we provide it a range of values for these dynamic dimensions?

Answer to the first question can be found here:

If current input shapes are in the range of the engine profile, the loaded engine can be safely used. Otherwise if input shapes are out of range, profile cache will be updated to cover the new shape and engine will be recreated based on the new profile (and also refreshed in the engine cache).

Looking at this documentation, it does appear the engine will be recreated to include input shapes not yet encountered in the range. @gradient-ascent-ai-lab I am not sure whether this would work or not as I have not looked into the TensorRT EP implementation in ONNXRT. However, going by the language in their documentation, I think if you send first inference with minimum shape and the second inference with max shape, the other inferences that lie between min and max shapes should not trigger TRT engine recreation. Can you confirm this?

1reaction

gradient-ascent-ai-labcommented, Dec 15, 2021

@tanmayv25 Thank you for your fast response. It appears this has solved the issue! 🎉

Not being aware of these dynamics, I see now that I had set the min/opt/max shapes rather low compared to real data, the initial model configuration sets the dynamic dimension shapes as 40/140/300, whereas real data is actually more often in the range of 300-800.

I have re-run Model Navigator with new values of 1/500/1000 and have created warmup samples of lengths 1 and 1000. Inferencing these first, I get the following output:

12-15 23:15:38 Inferencing min_size_warmup_sample
12-15 23:13:56 Attempt took 60.036985635757446 seconds, error was: timed out
12-15 23:14:56 Attempt took 60.00183844566345 seconds, error was: timed out
12-15 23:15:38 Successful inference took 42.318116426467896 seconds (3 attempts)
12-15 23:15:38 Inferencing max_size_warmup_sample
12-15 23:16:38 Attempt took 60.058837890625 seconds, error was: timed out
12-15 23:17:14 Successful inference took 35.7247838973999 seconds (2 attempts)
12-15 23:17:14 Inferencing example_0
12-15 23:17:14 Successful inference took 0.005524396896362305 seconds (1 attempts)
12-15 23:17:14 Inferencing example_1
12-15 23:17:14 Successful inference took 0.006449699401855469 seconds (1 attempts)
12-15 23:17:14 Inferencing example_2
12-15 23:17:14 Successful inference took 0.008361339569091797 seconds (1 attempts)
12-15 23:17:14 Inferencing example_3
12-15 23:17:14 Successful inference took 0.054425716400146484 seconds (1 attempts)
12-15 23:17:14 Inferencing example_4
12-15 23:17:14 Successful inference took 0.05500149726867676 seconds (1 attempts)
12-15 23:17:14 Inferencing example_5
12-15 23:17:14 Successful inference took 0.05948638916015625 seconds (1 attempts)
12-15 23:17:14 Inferencing example_6
12-15 23:17:14 Successful inference took 0.05574178695678711 seconds (1 attempts)
12-15 23:17:14 Inferencing example_7
12-15 23:17:14 Successful inference took 0.016832828521728516 seconds (1 attempts)
12-15 23:17:14 Inferencing example_8
12-15 23:17:14 Successful inference took 0.020303726196289062 seconds (1 attempts)

Now that this works, I will look into adding these warmup samples into the configuration to remove the need for the client to send these requests.

Thank you both for a fast resolution of this issue!

Top Results From Across the Web

TensorRT 8.4.1 Release Notes - NVIDIA Documentation Center

This release includes several fixes from the previous TensorRT releases as well as the following additional changes. These Release Notes are ...

TensorRT Execution Provider - NVIDIA - ONNX Runtime

With the TensorRT execution provider, the ONNX Runtime delivers better inferencing performance on the same hardware compared to generic GPU acceleration.

Neural Networks API | Android NDK

Using ANeuralNetworksBurst objects may result in faster executions, as they indicate to accelerators that resources may be reused between executions and that ...

Deploy large models on Amazon SageMaker using ...

Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) ...

Managing Video Streams in Runtime with the NVIDIA ...

Nvinfer server can work with backends like ONNX, TensorFlow, PyTorch, and TensorRT. It also enables creating ensemble models. DeepStream is ...