Server returns broken json requests when using TensorRT model config
See original GitHub issueDescription I have the following model config
name: "ner"
platform: "onnxruntime_onnx"
max_batch_size: 128
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ]
},
{
name: "token_type_ids"
data_type: TYPE_INT64
dims: [ -1 ]
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "start_logits"
data_type: TYPE_FP32
dims: [ -1, 5 ]
},
{
name: "end_logits"
data_type: TYPE_FP32
dims: [ -1, 5 ]
}
]
optimization { execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
parameters { key: "max_workspace_size_bytes" value: "4073741824" }
}]
}}
It loads fine. When I make requests to it, it returns me a payload broken in the middle such as
{
"id": "42",
"model_name": "ner",
"model_version": "1",
"outputs": [
{
"name": "end_logits",
"datatype": "FP32",
"shape": [
2,
8,
5
],
"data": [
Triton Information What version of Triton are you using?
Using triton image nvcr.io/nvidia/tritonserver:21.11-py3
To Reproduce Create a model repository with that config. Make a post request to http://localhost:8000/v2/models/ner/infer such as
curl -d '{
"id" : "42",
"inputs": [
{
"name": "input_ids",
"shape": [2, 8],
"datatype":"INT64",
"data": [[1,2,3,4,5,6,7,8],[1,2,3,4,5,6,7,8]]
},
{
"name": "token_type_ids",
"shape": [2,8],
"datatype":"INT64",
"data": [[1,2,3,4,5,6,7,8],[1,2,3,4,5,6,7,8]]
},
{
"name": "attention_mask",
"shape": [2, 8],
"datatype":"INT64",
"data": [[1,2,3,4,5,6,7,8],[1,2,3,4,5,6,7,8]]
}
],
"outputs" : [
{
"name" : "start_logits"
},
{
"name" : "end_logits"
}
]
}' -H "Content-Type: application/json" -X POST http://localhost:8000/v2/models/ner/infer
Expected behavior The full payload with the correct tensors The model without TensorRT optimization returns the full payload, with the output tensors as expected. Example:
{
"id":"42",
"model_name":"ner",
"model_version":"1",
"outputs":[
{
"name":"end_logits",
"datatype":"FP32",
"shape":[
2,
8,
5
],
"data":[
-9.151423454284668,
-9.307648658752442,
....
-9.209794044494629,
-9.336723327636719,
-9.302064895629883,
-10.113032341003418,
-9.484901428222657
]
},
{
"name":"start_logits",
"datatype":"FP32",
"shape":[
2,
8,
5
],
"data":[
-8.951006889343262,
-9.178339004516602,
....
-9.090873718261719,
-9.383940696716309,
-9.084102630615235,
-9.8538236618042,
-9.339037895202637
]
}
]
}
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:9 (6 by maintainers)
Top Results From Across the Web
NVIDIA Deep Learning TensorRT Documentation
This NVIDIA TensorRT Developer Guide demonstrates how to use the C++ and Python APIs for implementing the most common deep learning layers.
Read more >Serving TensorRT Models with NVIDIA Triton Inference Server
When the incoming traffic exceeds the Throughput, excess requests are queued, thus slowing down the request-response process. In my previous ...
Read more >Server API - CLIP-as-service 0.8.2 documentation
A server is a long-running program that receives raw sentences and images from clients, and returns CLIP embeddings to the client.
Read more >Running the MLPerf™ Inference v1.0 Benchmark on Dell EMC ...
This blog focuses on inference setup and describes the steps to run MLPerf inference v1.0 tests on Dell Technologies servers with NVIDIA ...
Read more >pyCUDA with Flask gives pycuda._driver.LogicError
I want to run a pyCUDA code on a flask server. The file runs correctly directly using python3 ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@rmccorm4 can you link the PRs for the fix and close this accordingly
Thank you all for looking into this. I confirm that model works as expected using onnxruntime. I don’t know why TensorRT converts values to NANs, but I haven’t tried TensorRT on this model besides altering the configuration of the triton server. Should I make an issue instead in the TensorRT repository?