question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ONNX TensorRT gives widely different result for fp16 quantized CLIP text embedding

See original GitHub issue

Description I’ve compiled a CLIP text embedding into an ONNX model (after some tweaking to make it ONNX compatible- see https://github.com/openai/CLIP/pull/219) then quantized to fp16. Now, when I run the ONNX model with TensorRT optimization, the result I see from Triton is just too far off from when I do CLIP text embedding directly with their code (seeing some 0.77 norm difference after normalization). Without TensorRT, in both CPU and GPU, I see a reasonably close value (<0.01 norm difference), so it seems highly suggestive of an issue with TensorRT. Could be related to https://github.com/NVIDIA/TensorRT/issues/1839

Triton Information What version of Triton are you using? r22.02 container

Are you using the Triton container or did you build it yourself? Using the container.

To Reproduce Steps to reproduce the behavior.

  1. Compile text embedding ONNX model
import clip
import clip.model
import torch
from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference
from torch import nn


class ClipTextFeatureNet(nn.Module):

    def __init__(self, clip_model: clip.model.CLIP):
        super(ClipTextFeatureNet, self).__init__()
        self.clip_model = clip_model

    def forward(self, text):
        text_encoding = self.clip_model.encode_text(text)
        text_features = text_encoding / text_encoding.norm(dim=1, keepdim=True)
        return text_features


output='out.onnx'
clip_model, clip_preprocessor = clip.load('ViT-B/32', device='cpu')
feature_net = ClipTextFeatureNet(clip_model)

# Build test input by tokenizing test text input.
with open("test_text/test.txt") as file:
    test_text = file.readlines()
dummy_input = dummy_input.type(torch.IntTensor)

temp_output = f"{output}_temp"
torch.onnx.export(feature_net,
                  dummy_input,
                  temp_output,
                  export_params=True,
                  input_names=["TEXT_TOKENS"],
                  output_names=["TEXT_EMBEDDING"],
                  opset_version=14,
                  dynamic_axes={
                      "TEXT_TOKENS": {
                          0: "batch_size"
                      },
                      "TEXT_EMBEDDING": {
                          0: "batch_size"
                      },
                  })

temp_model = onnx.load(temp_output)
out_mp = SymbolicShapeInference.infer_shapes(temp_model)
onnx.save(out_mp, output)
  1. Quantize to fp16
  2. Test passes for comparison between original model and fp16 ONNX model when not running on TensorRT.
  3. Build triton repository with the resulting onnx model and TensorRT optimization
  4. Issue infer call and compare the result with CLIP native result

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

# proto-file: model_config.proto
# proto-message: ModelConfig

name: "text_embedding"
platform: "onnxruntime_onnx"
max_batch_size: 16
input [{
  name: "TEXT_TOKENS"
  data_type: TYPE_INT32
  dims: [77]
}]
output [{
  name: "TEXT_EMBEDDING"
  data_type: TYPE_FP16
  dims: [1024]
}]

instance_group { 
  kind: KIND_AUTO
}
dynamic_batching {
  max_queue_delay_microseconds: 25
  preferred_batch_size: [1,4,16]
}

model_warmup {
  name: "warmup for batch 1"
  batch_size: 1
  inputs {
    key: "TEXT_TOKENS"
    value: {
      data_type: TYPE_INT32
      dims: [77]      
      random_data: True
    }
  }
}


model_warmup {
  name: "warmup for batch 4"
  batch_size: 4
  inputs {
    key: "TEXT_TOKENS"
    value: {
      data_type: TYPE_INT32
      dims: [77]      
      random_data: True
    }
  }
}


model_warmup {
  name: "warmup for batch 16"
  batch_size: 16
  inputs {
    key: "TEXT_TOKENS"
    value: {
      data_type: TYPE_INT32
      dims: [77]      
      random_data: True
    }
  }
}

optimization {
  execution_accelerators {
    gpu_execution_accelerator: [
      {
        name: "tensorrt"
        parameters [
          {
            key: "precision_mode"
            value: "FP16"
          },
          { 
            key: "max_workspace_size_bytes" 
            value: "4294967296" 
          }
        ]
      }
    ]
  }
}

Expected behavior Expect the result to be close to what CLIP’s original implementation would produce.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
GuanLuocommented, Apr 12, 2022

That may be issue fixed on the ORT side. The ORT version used by Triton is an older version and there is an ongoing PR (https://github.com/triton-inference-server/server/pull/4169) to upgrade to 1.11.0. Do you mind to repeat your experiment once Triton uses ORT 1.11.0?

0reactions
krishung5commented, May 18, 2022

Closing issue due to lack of activity. Feel free to re-open the issue if you would like to follow up with this.

Read more comments on GitHub >

github_iconTop Results From Across the Web

NVIDIA Deep Learning TensorRT Documentation
This toolkit is used to quantize different layers in the graph exclusively based on operator names, class, and pattern matching. The quantized graph...
Read more >
Export to ONNX - Transformers - Hugging Face
In this guide, we'll show you how to export Transformers models to ONNX (Open Neural Network eXchange). Once exported, a model can be...
Read more >
ONNX Community Day - June 24 - LF AI & Data Foundation
We then introduce a novel higher-level ONNX format called quantized ONNX (QONNX) that introduces three new operators —Quant, BipolarQuant, and ...
Read more >
Making stable diffusion 25% faster using TensorRT
This library generally helps us reach great inference performance while reducing the memory consumption on Nvidia GPUs. If you are interested ...
Read more >
Quantizing CLIP with ONNX Pt. 1: Smaller, Faster, Feasible?
While this does not mean that the quantized model is as capable as the original in other applications, it provides proof-of-concept that the ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found