question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Quantization with `transformers.onnx`

See original GitHub issue

Environment info

  • transformers version: 4.11.3
  • Platform: Linux-5.4.0-1059-aws-x86_64-with-glibc2.27
  • Python version: 3.9.5
  • PyTorch version (GPU?): 1.9.1+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: True
  • Using distributed or parallel set-up in script?: False

Who can help

Documentation: @sgugger

Information

Model I am using (XLMRobertaForTokenClassification):

The problem arises when using trying to

  • my own modified scripts:

The tasks I am working on is:

  • my own task or dataset (which I cannot show because of data security reasons)

Problem

At the moment, we are using the old Graph conversion approach convert_graph_to_onnx.py to export our models to ONNX. We used the quantized version. Now we would like to update to the new transformers.onnx package but we are not sure how to using quantization (see code example 1 below). The documentation is lacking a part how to use quantization with the new package. We tried to use the old method for quantization which worked: we used code example 2 for checking if there were lines which contained the phrase “quantized” (which was true). But when we used the quantized model for inference, the scores dropped massively. My question: Is the usage of the quantization correct for the new package or should we wait for an updated version?

Code example 1

from transformers.convert_graph_to_onnx import convert_pytorch, quantize, verify
from transformers.onnx.convert import export, validate_model_outputs
from transformers.onnx.features import FeaturesManager

def onnx_export(
    model_directory: Path,
    model_filepath: Path,
    tokenizer: PreTrainedTokenizer,
    atol: float = 0.0001,
    feature: str = "default",
    opset: int = 12,
    quantize_model: bool = False,
):
    """Export model to ONNX.

    Note
    ----
    Code taken and modified from:
    https://github.com/huggingface/transformers/blob/master/src/transformers/onnx/__main__.py

    Parameters
    ----------
    model_directory : Path
        Path to model directory
    model_filepath : Path
        Filepath to save model to
    tokenizer : PreTrainedTokenizer
        Pre-trained tokenizer.
    atol : float, optional
        Absolute difference tolerence when validating the model, by default 0.0001
    feature : str, optional
        Export the model with some additional feature, by default "default"
    opset : int, optional
        ONNX opset to use, by default 12
    quantize_model : bool, optional
        Quantize the model to be run with int8, by default False

    Raises
    ------
    ValueError
        If parameter 'opset' is not sufficient to export the chosen kind of model.
    """

    if feature not in [
        "default",
        "causal-lm",
        "seq2seq-lm",
        "sequence-classification",
        "token-classification",
        "multiple-choice",
        "question-answering",
    ]:
        feature = "default"

    # Allocate the model
    model = FeaturesManager.get_model_from_feature(feature, model_directory)
    model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(
        model, feature=feature
    )
    onnx_config = model_onnx_config(model.config)

    # Ensure the requested opset is sufficient
    if opset < onnx_config.default_onnx_opset:
        raise ValueError(
            f"Opset {opset} is not sufficient to export {model_kind}. "
            f"At least  {onnx_config.default_onnx_opset} is required."
        )

    _, onnx_outputs = export(tokenizer, model, onnx_config, opset, model_filepath)

    validate_model_outputs(onnx_config, tokenizer, model, model_filepath, onnx_outputs, atol)

    if quantize_model:
        quantized_model = quantize(model_filepath)
        verify(quantized_model)

        # remove the original model
        model_filepath.unlink()

        # rename quantized model
        quantized_model.rename(str(model_filepath.resolve()))

Code example 2

import onnx
model = onnx.load("model.onnx")
onnx.checker.check_model(model)
print(onnx.helper.printable_graph(model.graph))

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
michaelbenayouncommented, Nov 17, 2021

Hello! As Lysandre said, optimization features are currently added to optimum.

That being said, I see one potential reason for the scores dropping: in the old graph conversion script, you have an optimize step, which performs many optimizations on the graph. The resulting graph has a different topology than the one initially converted to ONNX: quantization is applied to this optimized version. Now in your code example 1, you are applying quantization directly to the converted ONNX model, so one thing you can try is optimizing the converted model (the same way it is done in the old conversion script), then applying quantization to this optimized version. Not only the resulting model will be faster, it might solve your issue as well.

4reactions
LysandreJikcommented, Nov 16, 2021

Hello! The new package does not have a quantization option as we’re moving all performance optimization features in a separate library with the sole focus of accelerating the performance of models.

The package is the following: https://github.com/huggingface/optimum

You can find a bit of documentation about the feature here: https://github.com/huggingface/optimum/tree/main/src/optimum/onnxruntime

The docs are currently a work in progress and should improve significantly over the coming weeks/months.

As for the questions regarding the quantization, I will let @michaelbenayoun and @mfuntowicz answer 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Exporting transformers models - Hugging Face
ONNX exporter supports generating a quantized version of the model to allow efficient inference. Quantization works by converting the memory representation of ...
Read more >
Quantize ONNX Models | onnxruntime
Our quantization tool works best when the tensor's shape is known. Both symbolic shape inference and ONNX shape inference help figure out tensor...
Read more >
Faster and smaller quantized NLP with Hugging Face and ...
ONNX Runtime INT8 quantization shows very promising results for both performance acceleration and model size reduction on Hugging Face ...
Read more >
Optimizing and deploying transformer INT8 inference with ...
This library can automatically or manually add quantization to PyTorch models and the quantized model can be exported to ONNX and imported by ......
Read more >
Convert Transformers to ONNX with Hugging Face Optimum
Optimum can be used for converting, quantization, graph optimization, accelerated training & inference with support for transformers pipelines.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found