ONNX TensorRT gives widely different result for fp16 quantized CLIP text embedding
See original GitHub issueDescription I’ve compiled a CLIP text embedding into an ONNX model (after some tweaking to make it ONNX compatible- see https://github.com/openai/CLIP/pull/219) then quantized to fp16. Now, when I run the ONNX model with TensorRT optimization, the result I see from Triton is just too far off from when I do CLIP text embedding directly with their code (seeing some 0.77 norm difference after normalization). Without TensorRT, in both CPU and GPU, I see a reasonably close value (<0.01 norm difference), so it seems highly suggestive of an issue with TensorRT. Could be related to https://github.com/NVIDIA/TensorRT/issues/1839
Triton Information What version of Triton are you using? r22.02 container
Are you using the Triton container or did you build it yourself? Using the container.
To Reproduce Steps to reproduce the behavior.
- Compile text embedding ONNX model
import clip
import clip.model
import torch
from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference
from torch import nn
class ClipTextFeatureNet(nn.Module):
def __init__(self, clip_model: clip.model.CLIP):
super(ClipTextFeatureNet, self).__init__()
self.clip_model = clip_model
def forward(self, text):
text_encoding = self.clip_model.encode_text(text)
text_features = text_encoding / text_encoding.norm(dim=1, keepdim=True)
return text_features
output='out.onnx'
clip_model, clip_preprocessor = clip.load('ViT-B/32', device='cpu')
feature_net = ClipTextFeatureNet(clip_model)
# Build test input by tokenizing test text input.
with open("test_text/test.txt") as file:
test_text = file.readlines()
dummy_input = dummy_input.type(torch.IntTensor)
temp_output = f"{output}_temp"
torch.onnx.export(feature_net,
dummy_input,
temp_output,
export_params=True,
input_names=["TEXT_TOKENS"],
output_names=["TEXT_EMBEDDING"],
opset_version=14,
dynamic_axes={
"TEXT_TOKENS": {
0: "batch_size"
},
"TEXT_EMBEDDING": {
0: "batch_size"
},
})
temp_model = onnx.load(temp_output)
out_mp = SymbolicShapeInference.infer_shapes(temp_model)
onnx.save(out_mp, output)
- Quantize to fp16
- Test passes for comparison between original model and fp16 ONNX model when not running on TensorRT.
- Build triton repository with the resulting onnx model and TensorRT optimization
- Issue infer call and compare the result with CLIP native result
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
# proto-file: model_config.proto
# proto-message: ModelConfig
name: "text_embedding"
platform: "onnxruntime_onnx"
max_batch_size: 16
input [{
name: "TEXT_TOKENS"
data_type: TYPE_INT32
dims: [77]
}]
output [{
name: "TEXT_EMBEDDING"
data_type: TYPE_FP16
dims: [1024]
}]
instance_group {
kind: KIND_AUTO
}
dynamic_batching {
max_queue_delay_microseconds: 25
preferred_batch_size: [1,4,16]
}
model_warmup {
name: "warmup for batch 1"
batch_size: 1
inputs {
key: "TEXT_TOKENS"
value: {
data_type: TYPE_INT32
dims: [77]
random_data: True
}
}
}
model_warmup {
name: "warmup for batch 4"
batch_size: 4
inputs {
key: "TEXT_TOKENS"
value: {
data_type: TYPE_INT32
dims: [77]
random_data: True
}
}
}
model_warmup {
name: "warmup for batch 16"
batch_size: 16
inputs {
key: "TEXT_TOKENS"
value: {
data_type: TYPE_INT32
dims: [77]
random_data: True
}
}
}
optimization {
execution_accelerators {
gpu_execution_accelerator: [
{
name: "tensorrt"
parameters [
{
key: "precision_mode"
value: "FP16"
},
{
key: "max_workspace_size_bytes"
value: "4294967296"
}
]
}
]
}
}
Expected behavior Expect the result to be close to what CLIP’s original implementation would produce.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
That may be issue fixed on the ORT side. The ORT version used by Triton is an older version and there is an ongoing PR (https://github.com/triton-inference-server/server/pull/4169) to upgrade to 1.11.0. Do you mind to repeat your experiment once Triton uses ORT 1.11.0?
Closing issue due to lack of activity. Feel free to re-open the issue if you would like to follow up with this.