Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quantization with `transformers.onnx`

See original GitHub issue

Environment info

transformers version: 4.11.3
Platform: Linux-5.4.0-1059-aws-x86_64-with-glibc2.27
Python version: 3.9.5
PyTorch version (GPU?): 1.9.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help

Documentation: @sgugger

Information

Model I am using (XLMRobertaForTokenClassification):

The problem arises when using trying to

my own modified scripts:

The tasks I am working on is:

my own task or dataset (which I cannot show because of data security reasons)

Problem

At the moment, we are using the old Graph conversion approach convert_graph_to_onnx.py to export our models to ONNX. We used the quantized version. Now we would like to update to the new transformers.onnx package but we are not sure how to using quantization (see code example 1 below). The documentation is lacking a part how to use quantization with the new package. We tried to use the old method for quantization which worked: we used code example 2 for checking if there were lines which contained the phrase “quantized” (which was true). But when we used the quantized model for inference, the scores dropped massively. My question: Is the usage of the quantization correct for the new package or should we wait for an updated version?

Code example 1

from transformers.convert_graph_to_onnx import convert_pytorch, quantize, verify
from transformers.onnx.convert import export, validate_model_outputs
from transformers.onnx.features import FeaturesManager

def onnx_export(
    model_directory: Path,
    model_filepath: Path,
    tokenizer: PreTrainedTokenizer,
    atol: float = 0.0001,
    feature: str = "default",
    opset: int = 12,
    quantize_model: bool = False,
):
    """Export model to ONNX.

    Note
    ----
    Code taken and modified from:
    https://github.com/huggingface/transformers/blob/master/src/transformers/onnx/__main__.py

    Parameters
    ----------
    model_directory : Path
        Path to model directory
    model_filepath : Path
        Filepath to save model to
    tokenizer : PreTrainedTokenizer
        Pre-trained tokenizer.
    atol : float, optional
        Absolute difference tolerence when validating the model, by default 0.0001
    feature : str, optional
        Export the model with some additional feature, by default "default"
    opset : int, optional
        ONNX opset to use, by default 12
    quantize_model : bool, optional
        Quantize the model to be run with int8, by default False

    Raises
    ------
    ValueError
        If parameter 'opset' is not sufficient to export the chosen kind of model.
    """

    if feature not in [
        "default",
        "causal-lm",
        "seq2seq-lm",
        "sequence-classification",
        "token-classification",
        "multiple-choice",
        "question-answering",
    ]:
        feature = "default"

    # Allocate the model
    model = FeaturesManager.get_model_from_feature(feature, model_directory)
    model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(
        model, feature=feature
    )
    onnx_config = model_onnx_config(model.config)

    # Ensure the requested opset is sufficient
    if opset < onnx_config.default_onnx_opset:
        raise ValueError(
            f"Opset {opset} is not sufficient to export {model_kind}. "
            f"At least  {onnx_config.default_onnx_opset} is required."
        )

    _, onnx_outputs = export(tokenizer, model, onnx_config, opset, model_filepath)

    validate_model_outputs(onnx_config, tokenizer, model, model_filepath, onnx_outputs, atol)

    if quantize_model:
        quantized_model = quantize(model_filepath)
        verify(quantized_model)

        # remove the original model
        model_filepath.unlink()

        # rename quantized model
        quantized_model.rename(str(model_filepath.resolve()))

Code example 2

import onnx
model = onnx.load("model.onnx")
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))

Issue Analytics

State:
Created 2 years ago
Reactions:5
Comments:5 (3 by maintainers)

Top GitHub Comments

4reactions

michaelbenayouncommented, Nov 17, 2021

Hello! As Lysandre said, optimization features are currently added to optimum.

That being said, I see one potential reason for the scores dropping: in the old graph conversion script, you have an optimize step, which performs many optimizations on the graph. The resulting graph has a different topology than the one initially converted to ONNX: quantization is applied to this optimized version. Now in your code example 1, you are applying quantization directly to the converted ONNX model, so one thing you can try is optimizing the converted model (the same way it is done in the old conversion script), then applying quantization to this optimized version. Not only the resulting model will be faster, it might solve your issue as well.

4reactions

LysandreJikcommented, Nov 16, 2021

Hello! The new package does not have a quantization option as we’re moving all performance optimization features in a separate library with the sole focus of accelerating the performance of models.

The package is the following: https://github.com/huggingface/optimum

You can find a bit of documentation about the feature here: https://github.com/huggingface/optimum/tree/main/src/optimum/onnxruntime

The docs are currently a work in progress and should improve significantly over the coming weeks/months.

As for the questions regarding the quantization, I will let @michaelbenayoun and @mfuntowicz answer 😃

Top Results From Across the Web

Exporting transformers models - Hugging Face

ONNX exporter supports generating a quantized version of the model to allow efficient inference. Quantization works by converting the memory representation of ...

Quantize ONNX Models | onnxruntime

Our quantization tool works best when the tensor's shape is known. Both symbolic shape inference and ONNX shape inference help figure out tensor...

Faster and smaller quantized NLP with Hugging Face and ...

ONNX Runtime INT8 quantization shows very promising results for both performance acceleration and model size reduction on Hugging Face ...

Optimizing and deploying transformer INT8 inference with ...

This library can automatically or manually add quantization to PyTorch models and the quantized model can be exported to ONNX and imported by ......

Convert Transformers to ONNX with Hugging Face Optimum

Optimum can be used for converting, quantization, graph optimization, accelerated training & inference with support for transformers pipelines.